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This  paper  documents  a  large  set  of  heretofore  unpublished  details  Collins  used  in  his  parser,  such 
that,  along  with  Collins'  thesis  (Collins,  1999),  this  paper  contains  all  information  necessary  to 
duplicate  Collins’  benchmark  residts.  Indeed,  these  as-yet-unpublished  details  account  for  an 
11%  relative  increase  in  error  from  an  implementation  including  all  details  to  a  clean-room 
implementation  of  Collins'  model.  We  also  show  a  cleaner  and  equally-well-performing  method 
for  the  handling  of  punctuation  and  conjunction,  and  reveal  certain  other  probabilistic  oddities 
about  Collins'  parser.  We  analyze  not  only  the  effect  of  the  unpublished  details,  but  also  re¬ 
analyze  the  effect  of  certain  well-known  details,  revealing  that  bilexical  dependencies  are  barely 
used  by  the  model  and  that  head  choice  is  not  nearly  as  important  to  overall  parsing  performance 
as  once  thought.  Finally,  we  perform  experiments  that  show  that  the  true  discriminative  power 
of  lexicalization  appears  to  lie  in  the  fact  that  unlexicalized  syntactic  structures  are  generated 
conditioning  on  the  head  word  and  its  part  of  speech. 

Introduction 

Michael  Collins'  parsing  models  (Collins,  1996;  Collins,  1997;  Collins,  1999)  have  been 
quite  influential  in  the  field  of  natural  language  processing.  Not  only  did  they  achieve 
new  performance  benchmarks  on  parsing  the  Penn  Treebank  (Marcus,  Santorini,  and 
Marcinkiewicz,  1993),  and  not  only  did  they  serve  as  the  basis  of  Collins'  own  future 
work  (Collins,  2000;  Collins  and  Duffy,  2002),  but  they  also  served  as  the  basis  of  im¬ 
portant  work  on  parser  selection  (Henderson  and  Brill,  1999),  an  investigation  of  corpus 
variation  and  the  effectiveness  of  bilexical  dependencies  (Gildea,  2001),  sample  selection 
(Hwa,  2001),  bootstrapping  non-English  parsers  (Hwa,  Resnik,  and  Weinberg,  2002)  and 
for  the  automatic  labeling  of  semantic  roles  and  predicate-argument  extraction  (Gildea 
and  Jurafsky,  2000;  Gildea  and  Palmer,  2002),  as  well  as  that  of  other  research  efforts. 

Recently,  in  order  to  continue  our  work  combining  word  sense  with  parsing  (Bikel, 
2000)  and  the  study  of  language-dependent  and  -independent  parsing  features  (Bikel 
and  Chiang,  2000),  we  built  a  multi-lingual  parsing  engine  that  is  capable  of  instantiating 
a  wide  variety  of  generative,  statistical  parsing  models  (Bikel,  2002). 1  As  an  appropriate 
baseline  model,  we  chose  to  instantiate  the  parameters  of  Collins'  Model  2.  This  task 
proved  more  difficult  than  it  initially  appeared.  Starting  with  Collins'  thesis  (Collins, 
1999),  we  reproduced  all  the  parameters  described,  but  did  not  achieve  nearly  the  same 
high  performance  on  the  well-established  development  test  set  of  Section  00  of  the  Penn 
Treebank.  Together  with  Collins'  thesis,  this  paper  contains  all  the  information  neces¬ 
sary  to  replicate  Collins'  parsing  results.2  Specifically,  this  paper  describes  all  the  as-yet- 
unpublished  details  and  features  of  Collins'  model,  and  some  analysis  of  the  effect  of 
these  features  with  respect  to  parsing  performance,  as  well  as  some  comparative  anal- 

*  Department  of  Computer  and  Information  Science,  3330  Walnut  Street,  Philadelphia,  PA  19104. 

1This  engine  is  publicly  available  by  visiting  http :  //www .  cis  .  upenn .  edu/~dbikel/ software  .  html. 

2  In  the  course  of  replicating  Collins'  results,  it  was  brought  to  our  attention  that  several  other  researchers 
had  also  tried  to  do  this,  and  had  also  gotten  performance  that  fell  short  of  Collins'  published  results.  For  ex¬ 
ample,  Gildea  (2001)  re-implemented  Collins'  Model  1,  but  obtained  results  with  roughly  16.7%  more  relative 
error  than  Collins'  reported  results  using  that  Model. 
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ysis  of  the  effects  of  published  features.3  In  particular,  implementing  Collins'  model 
using  only  the  published  details  causes  an  11%  increase  in  relative  error  over  Collins' 
own  published  results.  That  is,  taken  together,  all  the  unpublished  details  have  a  signifi¬ 
cant  effect  on  overall  parsing  performance.  In  addition  to  the  effects  of  the  unpublished 
details,  we  also  have  new  evidence  to  show  that  the  discriminative  power  of  Collins' 
model  does  not  lie  where  once  thought:  bilexical  dependencies  play  an  extremely  small 
role  in  Collins'  models  (Gildea,  2001),  and  head  choice  is  not  nearly  as  critical  as  once 
thought.  This  paper  also  discusses  the  rationale  for  various  parameter  choices.  In  gen¬ 
eral,  we  will  limit  our  discussion  to  Model  2,  but  we  make  occasional  reference  to  Model 
3,  as  well. 

1.  Motivation 

There  are  three  primary  motivations  for  this  work.  First,  Collins'  parsing  model  repre¬ 
sents  a  widely-used  and  -cited  parsing  model.  As  such,  if  it  is  not  desirable  to  use  it  as  a 
black  box  (it  has  only  recently  been  made  publicly  available),  then  it  should  be  possible 
to  replicate  the  model  in  full,  providing  a  necessary  consistency  among  research  efforts 
employing  it.  Careful  examination  of  its  intricacies  will  also  allow  researchers  to  devi¬ 
ate  from  the  original  model  when  they  think  it  is  warranted,  and  accurately  document 
those  deviations,  as  well  as  understand  the  implications  of  doing  so. 

The  second  point  is  related  to  the  first:  science  dictates  that  experiments  be  replica¬ 
ble,  for  this  is  the  way  we  may  test  and  validate  them.  The  work  described  here  comes  in 
the  wake  of  several  previous  efforts  to  replicate  this  particular  model,  but  this  is  the  first 
such  effort  to  provide  a  faithful  and  equally-well-performing  emulation  of  the  original. 

The  third  motivation  is  that  a  deep  understanding  of  an  existing  model — its  intrica¬ 
cies,  the  interplay  of  its  many  features — provides  the  necessary  platform  for  advance¬ 
ment  to  newer,  "better"  models.  This  is  especially  true  in  an  area  like  statistical  parsing 
that  has  seen  rapid  maturation  followed  by  a  soft  "plateau"  in  performance.  Rather 
than  simply  throwing  features  into  a  new  model  and  measuring  their  effect  in  a  crude 
way  by  using  standard  evaluation  metrics,  this  work  aims  to  provide  a  more  thorough 
understanding  of  the  nature  of  a  model's  features.  This  understanding  is  not  only  use¬ 
ful  in  its  own  right,  but  should  help  point  the  way  toward  newer  features  to  model,  or 
better  modeling  techniques,  for  we  are  in  the  best  position  for  advancement  when  we 
understand  existing  strengths  and  limitations. 

2.  Model  Overview 

The  Collins  parsing  model  decomposes  the  generation  of  a  parse  tree  into  many,  small 
steps,  using  reasonable  independence  assumptions  to  make  the  parameter  estimation 
problem  tractable.  Even  though  decoding  proceeds  bottom-up,  the  model  is  defined 
in  a  top-down  manner.  Every  nonterminal  label  in  every  tree  is  lexicalized :  the  label  is 
augmented  to  include  a  unique  head  word  (and  that  head  word's  part  of  speech)  that 
the  node  dominates.  The  lexicalized  PCFG  that  sits  behind  Model  2  has  rules  of  the 
form 

P  ►  LnPn-l  ■  •  •  LiHRi  ■  ■  ■  Rn-iRn  (1) 

where  P,  Li,  Ri  and  H  are  all  lexicalized  nonterminals,  and  P  inherits  its  lexical  head 
from  its  distinguished  head  child,  H.  In  this  generative  model,  first  P  is  generated,  then 

discovering  these  details  and  features  involved  a  great  deal  of  reverse-engineering,  and,  ultimately, 
much  discussion  with  Mike  Collins  himself  and  perusal  of  his  code.  Many  thanks  to  Mike  Collins  for  his 
generosity.  As  a  word  of  caution,  this  paper  is  exhaustive  in  its  presentation  of  all  such  details  and  features, 
and  we  cannot  guarantee  that  every  reader  will  find  every  detail  interesting. 


2 


Intricacies  of  Collins'  Parsing  Model 


Daniel  M.  Bikel 


its  head-child  H,  then  each  of  the  left-  and  right-modifying  nonterminals  are  generated 
from  the  head  outward.  The  modifying  nonterminals  L,  and  R,  are  generated  condi¬ 
tioning  on  P  and  H,  as  well  as  a  distance  metric  (based  on  what  material  intervenes 
between  the  currently-generated  modifying  nonterminal  and  H)  and  an  incremental 
subcategorization  frame  feature  (a  multiset  containing  the  arguments  of  H  that  have 
yet  to  be  generated  on  the  side  of  H  in  which  the  currently-generated  nonterminal  falls). 
Note  that  if  the  modifying  nonterminals  were  generated  completely  independently,  the 
model  would  be  very  impoverished,  but  in  actuality,  by  including  the  distance  and 
subcategorization  frame  features,  the  model  captures  a  crucial  bit  of  linguistic  reality, 
viz.,  that  words  often  have  well-defined  sets  of  complements  and  adjuncts,  occurring 
with  some  well-defined  distribution  in  the  right  hand  sides  of  a  (context-free)  rewriting 
system.  The  process  proceeds  recursively,  treating  each  newly-generated  modifier  as  a 
parent  and  then  generating  its  head  and  modifier  children;  the  process  terminates  when 
(lexicalized)  preterminals  are  generated.  As  a  way  to  guarantee  the  consistency  of  the 
model,  the  model  also  generates  two  hidden  +STOP+  nonterminals  as  the  leftmost  and 
rightmost  children  of  every  parent  (see  Figure  7  in  §4.1). 

3.  Preprocessing  training  trees 

To  the  casual  reader  of  Collins'  thesis,  it  may  not  be  immediately  apparent  that  there  are 
quite  a  few  preprocessing  steps  for  each  annotated  training  tree,  and  that  these  steps 
are  crucial  to  the  performance  of  the  parser.  We  identified  eleven  preprocessing  steps 
necessary  to  prepare  training  trees  when  using  Collins'  parsing  model.  They  are: 

1 .  pruning  of  unnecessary  nodes 

2.  adding  base  NP  nodes  (NPBs) 

3.  "repairing"  base  NPs 

4.  adding  gap  information  (applicable  to  Model  3  only) 

5.  relabeling  of  sentences  with  no  subjects  (subjectless  sentences) 

6.  removing  null  elements 

7.  raising  punctuation 

8.  identification  of  argument  nonterminals 

9.  stripping  unused  nonterminal  augmentations 

10.  "repairing"  subjectless  sentences 

11.  head-finding 

The  order  presented  above  is  not  arbitrary,  as  some  of  the  steps  depend  on  results  pro¬ 
duced  in  previous  steps.  Also,  we  have  separated  the  steps  into  their  functional  units; 
an  implementation  could  combine  steps  that  are  independent  of  one  another  (for  clar¬ 
ity,  our  implementation  does  not,  however).  Finally,  we  note  that  the  final  step,  head¬ 
finding,  is  actually  needed  by  some  of  the  previous  steps  in  certain  cases;  in  our  imple¬ 
mentation,  we  selectively  employ  a  head-finding  module  during  the  first  10  steps  where 
necessary. 
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3.1  Coordinated  phrases 

A  few  of  the  preprocessing  steps  rely  on  a  notion  of  a  coordinated  phrase.  The  condi¬ 
tions  under  which  a  phrase  is  considered  coordinated  are  slightly  more  detailed  than 
described  in  Collins'  thesis.  A  node  represents  a  coordinated  phrase  if 

•  it  has  a  non-head  child  that  is  a  coordinating  conjunction,  and 

•  that  conjunction  is  either 

-  post-head  but  non-final,  or 

-  immediately  pre-head  but  non-initial  (where  "immediately"  means  "with 
nothing  intervening  except  punctuation").4 

In  the  Penn  Treebank,  a  coordinating  conjunction  is  any  preterminal  node  with  the  label 

cc. 

This  definition  essentially  picks  out  all  phrases  where  the  head  child  is  truly  con¬ 
joined  to  some  other  phrase,  as  opposed  to  a  phrase  where,  say,  there  is  an  initial  CC 
word,  such  as  an  S  that  begins  with  the  conjunction  "But . . .". 

3.2  Pruning  of  unnecessary  nodes 

This  preprocessing  step  simply  removes  preterminals  that  should  have  little  or  no  bear¬ 
ing  on  parser  performance.  In  the  case  of  the  English  Treebank,  the  pruned  subtrees  are 
all  preterminal  subtrees  whose  root  label  is  one  of  .}.  There  are  two  reasons  to 

remove  these  types  of  subtrees  when  parsing  the  English  Treebank:  in  the  treebanking 
guidelines  (Bies,  1995),  quotation  marks  were  given  the  lowest  possible  priority,  and 
thus  cannot  be  expected  to  appear  within  constituent  boundaries  in  any  kind  of  consis¬ 
tent  way,  and  neither  of  these  types  of  preterminals — nor  any  punctuation  marks,  for 
that  matter — count  towards  the  parsing  score. 

3.3  Adding  base  NP  nodes  (NPBs) 

An  NP  is  basal  when  it  does  not  itself  dominate  an  NP;  such  NP  nodes  are  relabeled  NPB. 
More  accurately,  an  NP  is  basal  when  it  dominates  no  other  NPs  except  possessive  NPs, 
where  a  possessive  NP  is  an  NP  that  dominates  POS,  the  preterminal  possessive  marker 
for  the  Penn  Treebank.  These  possessive  NPs  are  almost  always  themselves  base  NPs, 
and  are  therefore  (almost  always)  relabeled  NPB. 

For  consistency's  sake,  when  an  NP  has  been  relabeled  as  NPB,  a  normal  NP  node  is 
often  inserted  as  a  parent  nonterminal.  This  insertion  ensures  that  NPB  nodes  are  always 
dominated  by  NP  nodes.  The  conditions  for  inserting  this  "extra"  NP  level  are  slightly 
more  detailed  than  described  in  Collins'  thesis,  however.  The  extra  NP  level  is  added  if 
one  of  the  following  conditions  holds: 

•  the  parent  of  the  NPB  is  not  an  NP 

•  the  parent  of  the  NPB  is  an  NP,  but  constitutes  a  coordinated  phrase  (see  Fig¬ 
ure  1) 

•  the  parent  of  the  NPB  is  an  NP,  but 

-  the  parent's  head  child  is  not  the  NPB  and 

4Our  positional  descriptions  here,  such  as  "post-head  but  non-final",  refer  to  positions  within  the  list  of 
immediately-dominated  children  of  the  coordinated  phrase  node,  as  opposed  to  positions  within  the  entire 
sentence. 
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1.  Coordinated  phrase  2.  Base  NPs  relabeled  3.  Extra  NP  nodes  inserted 
Figure  1 

An  NP  that  constitutes  a  coordinated  phrase. 


the  comedian  ,  Tom  Foolery 

1.  Before  extra  NP  addition 

(the  NPB  "the  comedian"  is  the 

head  child). 


NP 


Tom  Foolery 

2.  After  extra  NP  insertion. 


Figure  2 

A  non-head  NPB  child  of  NP  requires  insertion  of  extra  NP. 
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VP  VP 


the  will  to  continue 

Before  repair.  After  repair. 


Figure  3 

An  NPB  is  "repaired". 


-  the  parent  has  not  already  been  relabeled  as  an  NPB  (see  Figure  2) 5 

In  post-processing,  when  an  NPB  is  an  only  child  of  an  NP  node,  the  extra  NP  level  is 
removed  by  merging  the  two  nodes  into  a  single  NP  node,  and  all  remaining  NPB  nodes 
are  relabeled  NP. 

3.4  Repairing  base  NPs 

The  insertion  of  extra  NP  levels  above  certain  NPB  nodes  achieves  a  degree  of  consis¬ 
tency  for  NPs,  effectively  causing  the  portion  of  the  model  that  generates  children  of  NP 
nodes  to  have  less  perplexity.  Collins  appears  to  have  made  a  similar  effort  to  improve 
the  consistency  of  the  NPB  model.  NPB  nodes  that  have  sentential  nodes  as  their  final 
(rightmost)  child  are  "repaired":  the  sentential  child  is  raised  so  that  it  becomes  a  new 
right-sibling  of  the  NPB  node  (see  Figure  3). 6 

While  such  a  transformation  is  reasonable,  it  is  interesting  to  note  that  Collins' 
parser  performs  no  equivalent  detransformation  when  parsing  is  complete,  meaning 
that  when  the  parser  produces  the  "repaired"  structure  during  testing,  there  is  a  spuri¬ 
ous  NP  bracket.7 

3.5  Adding  gap  information 

The  gap  feature  is  discussed  extensively  in  chapter  7  of  Collins'  thesis,  and  is  applicable 
only  to  his  Model  3.  This  preprocessing  step  locates  every  null  element  preterminal, 
finds  its  co-indexed  WHNP  antecedent  higher  up  in  the  tree,  replaces  the  null  element 
preterminal  with  a  special  trace  tag  and  threads  the  gap  feature  in  every  nonterminal 
in  the  chain  between  the  common  ancestor  of  the  antecedent  and  the  trace.  The  only 
detail  we  would  like  to  highlight  here  is  that  an  implementation  of  this  preprocessing 
step  should  check  for  cases  where  threading  is  impossible,  such  as  when  two  filler- 
gap  dependencies  cross.  An  implementation  should  be  able  to  handle  nested  filler-gap 
dependencies,  however. 


5Only  applicable  if  relabeling  of  NPs  is  performed  using  a  pre-order  tree  traversal. 

6Collins  defines  a  sentential  node,  for  the  purposes  of  repairing  NPBs,  to  be  any  node  that  begins  with 
the  letter  S.  For  the  Penn  Treebank,  this  defines  the  set  {S,  SBAR,  SBARQ,  SINV,  SQ  }. 

7  Since,  as  mentioned  above,  the  only  time  an  NPB  is  merged  with  its  parent  is  when  it  is  the  only  child 
of  an  NP. 
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3.6  Relabeling  subjectless  sentences 

The  node  labels  of  sentences  with  no  subjects  are  transformed  from  S  to  SG.  This  step 
enables  the  parsing  model  to  be  sensitive  to  the  different  contexts  in  which  such  sub¬ 
jectless  sentences  occur  as  compared  to  normal  S  nodes,  since  the  subjectless  sentences 
are  functionally  acting  as  noun  phrases.  Collins'  example  of 

[s  [s  Flying  planes]  is  dangerous] 

illustrates  the  utility  of  this  transformation.  However,  the  conditions  under  which  an 
S  may  be  relabeled  are  not  spelled  out;  one  might  assume  that  every  S  whose  subject 
(identified  in  the  Penn  Treebank  with  the  -SB  J  function  tag)  dominates  a  null  element 
should  be  relabeled  SG.  In  actuality,  the  conditions  are  much  stricter.  An  S  is  relabeled 
to  SG  when  the  following  conditions  hold: 

•  one  of  its  children  dominates  a  null  element  child  marked  with  -SBJ 

•  its  head  child  is  a  VP 

•  no  arguments  appear  prior  to  the  head  child  (see  §§3.9  and  3.11) 

The  latter  two  conditions  appear  to  be  an  effort  to  capture  only  those  subjectless  sen¬ 
tences  that  are  based  around  gerunds,  as  in  the  "flying  planes"  example.8 

3.7  Removing  null  elements 

This  step  simply  involves  pruning  the  tree  to  eliminate  any  subtree  that  only  dominates 
null  elements.  The  special  trace  tag  that  is  inserted  in  the  step  that  adds  gap  information 
(§3.5)  is  excluded,  as  it  is  specifically  chosen  to  be  something  other  than  the  null  element 
preterminal  marker  (which  is  -NONE-  in  the  Penn  Treebank). 

3.8  Raising  punctuation 

This  step  is  discussed  in  detail  in  chapter  7  of  Collins'  thesis.  The  main  idea  is  to  raise 
punctuation — which  is  any  preterminal  subtree  where  the  part  of  speech  is  either  a 
comma  or  a  colon — to  the  highest  possible  point  in  the  tree,  so  that  it  always  sits  be¬ 
tween  two  other  nonterminals.  Punctuation  that  occurs  at  the  very  beginning  or  end 
of  a  sentence  is  "raised  away",  i.e.,  pruned.  In  addition,  any  implementation  of  this 
step  should  handle  the  case  where  multiple  punctuation  elements  appear  as  the  initial 
or  final  children  of  some  node,  as  well  as  the  more  pathological  case  where  multiple 
punctuation  elements  appear  along  the  left  or  right  frontier  of  a  subtree  (see  Figure  4). 
Finally,  it  is  not  clear  what  to  do  with  nodes  that  only  dominate  punctuation  preter¬ 
minals.  Our  implementation  simply  issues  a  warning  in  such  cases,  and  leaves  the 
punctuation  symbols  untouched. 

3.9  Identification  of  argument  nonterminals 

Collins  employs  a  small  set  of  heuristics  to  mark  certain  nonterminals  as  arguments,  by 
appending  -A  to  the  nonterminal  label.  This  section  reveals  three  unpublished  details 
about  Collins'  argument  finding. 

•  The  published  argument-finding  rule  for  PPs  is  to  choose  the  first  nonterminal 
after  the  head  child.  In  a  large  majority  of  cases,  this  marks  the  NP  argument  of 
the  preposition.  The  actual  rule  used  is  slightly  more  complicated:  the  first  non¬ 
terminal  to  the  right  of  the  head  child  that  is  neither  PRN  nor  a  part  of  speech  tag 


8We  assume  the  "G"  in  the  label  SG  was  chosen  to  stand  for  the  word  "gerund". 
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NNP  ,  ,  jane  NNP  ,  ,  and  NNP 


John  ,  John  Jane 

Figure  4 

Raising  punctuation:  perverse  case  where  multiple  punctuation  elements  appear  along  a  frontier 
of  a  subtree. 


NP-A 


NNP 


Elizabeth 


VBN-HEAD 

i 

elected 


S-A 

I 

NP-HEAD-A 

I 

NPB 


DT  NN 

I  I 

a  director 


Figure  5 

Head  children  are  not  exempt  from  being  relabeled  as  arguments. 


is  marked  as  an  argument.  The  nonterminal  PRN  in  the  Penn  Treebank  marks 
parenthetical  expressions,  which  can  occur  fairly  often  inside  a  PP,  as  in  the 
phrase  "on  (or  above)  the  desk". 

•  Children  that  are  part  of  a  coordinated  phrase  (see  §3.1)  are  exempt  from  being 
relabeled  as  argument  nonterminals. 

•  Head  children  are  distinct  from  their  siblings  by  virtue  of  the  head-generation 
parameter  class  in  the  parsing  model.  In  spite  of  this,  Collins'  trainer  actually 
does  not  exempt  head  children  from  being  relabeled  as  arguments  (see  Figure 
5).9 

3.10  Stripping  unused  nonterminal  augmentations 

This  step  simply  involves  stripping  away  all  nonterminal  augmentations,  except  those 
that  have  been  added  from  other  preprocessing  steps  (such  as  the  -A  augmentation 
for  argument  labels).  This  includes  the  stripping  away  of  all  function  tags  and  indices 
marked  by  the  Treebank  annotators. 

9It  is  not  clear  why  this  is  done,  and  so  in  our  parsing  engine,  we  make  such  behavior  optional  via  a 
run-time  setting. 
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John 

Jane 
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NNP  CC  NNP-HEAD 


John  and  Jane 


Figure  6 

Head  moves  from  right  to  left  conjunct  in  a  coordinated  phrase,  except  when  the  parent 
nonterminal  is  NPB. 


3.11  Repairing  subjectless  sentences 

With  arguments  identified  as  described  in  §3.9,  if  a  subjectless  sentence  is  found  to  have 
an  argument  prior  to  its  head,  this  step  detransforms  the  SG  so  it  reverts  to  being  an  S. 

3.12  Head-finding 

Head-finding  is  discussed  at  length  in  Collins'  thesis,  and  the  head-finding  rules  used 
are  included  in  his  Appendix  A.  There  are  a  few  unpublished  details  worth  mentioning, 
however. 

There  is  no  head-finding  rule  for  NX  nonterminals,  so  the  default  rule  of  picking 
the  leftmost  child  is  used.10  NX  nodes  roughly  represent  the  N'  level  of  syntax,  and  in 
practice  often  denote  base  NPs.  As  such,  the  default  rule  often  picks  out  a  less-than- 
ideal  head  child,  such  as  an  adjective  that  is  the  leftmost  child  in  a  base  NP. 

Collins'  thesis  discusses  a  case  where  the  initial  head  is  modified  when  it  is  found 
to  denote  the  right  conjunct  in  a  coordinated  phrase.  That  is,  if  the  head  rules  pick  out 
a  head  that  is  preceded  by  a  CC  that  is  non-initial,  the  head  should  be  modified  to  be 
the  nonterminal  immediately  to  the  left  of  the  CC  (see  Figure  6).  An  important  detail  is 
that  such  "head  movement"  does  not  occur  inside  base  NPs.  That  is,  a  phrase  headed 
by  NPB  may  indeed  look  like  it  constitutes  a  coordinated  phrase — it  has  a  CC  that  is 
non-initial  but  to  the  left  of  the  currently-chosen  head — but  the  currently-chosen  head 
should  remain  as  is.11  As  we  shall  see,  there  is  exceptional  behavior  for  base  NPs  in 
almost  every  part  of  the  Collins  parser. 

4.  Training 

The  trainer 's  job  is  decompose  annotated  training  trees  into  a  series  of  head-  and  modifier- 
generation  steps,  recording  the  counts  of  each  of  these  steps.  Referring  to  (1),  each  H, 

10In  our  first  attempt  at  replicating  Collins'  result,  we  simply  employed  the  same  head-finding  rule  for 
NX  nodes  as  for  NP  nodes.  This  choice  yields  different — but  not  necessarily  inferior — results. 

11  In  §3.1,  we  defined  coordinated  phrases  in  terms  of  heads,  but  here  we  are  discussing  how  the  head- 
finder  itself  needs  to  determine  whether  a  phrase  is  coordinated.  It  does  this  by  considering  the  potential  new 
choice  of  head:  if  the  head-finding  rules  pick  out  a  head  that  is  preceded  by  a  non-initial  CC  ("lane"),  will 
moving  the  head  to  be  a  child  to  the  left  of  the  CC  ("John")  yield  a  coordinated  phrase?  If  so,  then  the  head 
should  be  moved — except  when  the  parent  is  NPB. 
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VP 


Figure  7 

vi  ("verb  intervening")  feature  is  true  when  generating  right-hand  +STOP+  nonterminal, 
because  the  NP  "the  will  to  continue"  contains  a  verb. 


Lf  and  Tlj  are  generated  conditioning  on  previously-generated  items,  and  each  of  these 
events  consisting  of  a  generated  item  and  some  maximal  history  context  are  counted. 
Even  with  all  this  decomposition,  sparse  data  is  still  a  problem,  and  so  each  probability 
estimate  for  some  generated  item  given  a  maximal  context  is  smoothed  with  coarser 
distributions  using  less  context,  whose  counts  are  derived  from  these  "top-level"  head- 
and  modifier-generation  counts. 

4.1  Verb  intervening 

As  mentioned  in  §2,  instead  of  generating  each  modifier  independently,  the  model  con¬ 
ditions  the  generation  of  modifiers  on  certain  aspects  of  the  history.  One  such  function 
of  the  history  is  the  distance  metric.  One  of  the  two  components  of  this  distance  metric  is 
what  we  will  call  the  "verb  intervening"  feature,  which  is  a  predicate  vi  that  is  true  if 
a  verb  has  been  generated  somewhere  in  the  surface  string  of  the  previously-generated 
modifiers  on  the  current  side  of  the  head.  For  example,  in  Figure  7,  when  generating  the 
right-hand  +STOP+  nonterminal  child  of  the  VP,  the  vi  predicate  is  true,  because  one 
of  the  previously-generated  modifiers  on  the  right  side  of  the  head  dominates  a  verb, 
"continue".12  More  formally,  the  definition  of  this  feature  is  most  easily  defined  in  terms 
of  a  recursively-defined  cv  ("contains  verb")  predicate,  which  is  true  if  and  only  if  a 
node  dominates  a  verb. 


cv(P) 


V  cv(M) 

M  child  of  P 

true 

kf alse 


if  M  is  not  a  preterminal, 

if  P  is  a  verb  preterminal, 
otherwise. 


(2) 


Referring  to  (2),  we  define  the  verb  intervening  predicate  recursively  on  the  lst-order 
Markov  process  generating  modifying  nonterminals: 


vi  (Li) 


false  if  i  <  1, 

cv(Li-i)  V  vi (Li-2)  if  i  >  1, 


(3) 


and  similarly  for  right  modifiers. 

What  is  considered  to  be  a  verb?  While  this  is  not  spelled  out,  as  it  happens,  it  is  any 
word  whose  part-of-speech  tag  is  one  of  {VB,  VBD,  VBG,  VBN,  VBP,  VBZ}.  That  is,  the  cv 

12Note  that  any  word  in  the  surface  strings  dominated  by  the  previously-generated  modifiers  will  trig¬ 
ger  the  vi  predicate.  This  is  possible  because  in  a  history-based  model  (cf.  (Black  et  al.,  1992)),  anything 
previously-generated — i.e.,  anything  in  the  history — can  appear  in  the  conditioning  context. 
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predicate  returns  true  only  for  these  preterminals  and  false  for  all  other  preterminals. 
Crucially,  this  set  omits  MD,  which  is  the  marker  for  modal  verbs.  Another  crucial  point 
about  the  vi  predicate  is  that  it  does  not  include  verbs  that  appear  within  base  NPs.  Put 
another  way,  in  order  to  emulate  Collins'  model,  we  need  to  amend  the  definition  of  cv 
by  stipulating  that  cv(NPB)  =  false. 

4.2  Skip  certain  trees 

One  oddity  of  Collins'  trainer  that  we  mention  here  for  completeness'  sake  is  that  it  skips 
certain  training  trees.  For  "odd  historical  reasons",13  the  trainer  skips  all  trees  with  more 
than  500  tokens,  where  in  this  context  a  token  is  considered  to  be  a  word,  a  nonterminal 
label  or  a  parenthesis.  This  oddity  entails  that  even  some  relatively  short  sentences  get 
skipped  because  they  have  lots  of  tree  structure.  In  the  standard  Wall  Street  Journal 
training  corpus.  Sections  02-21  of  the  Penn  Treebank,  there  are  120  such  sentences  that 
are  skipped.  Unless  there  is  something  inherently  wrong  with  these  trees,  one  would 
predict  that  adding  them  to  the  training  set  would  improve  a  parser's  performance.  As 
it  happens,  there  is  actually  a  minuscule  (and  probably  statistically  insignificant)  drop  in 
performance  (see  §7.1,  Figure  16). 

4.3  Unknown  words 

4.3.1  The  threshold  problem  Collins  mentions  in  Chapter  7  of  his  thesis  that  "[a]ll 
words  occurring  less  than  5  times  in  training  data,  and  words  in  test  data  which  have 
never  been  seen  in  training,  are  replaced  with  the  'UNKNOWN'  token."  The  frequency 
below  which  words  are  considered  unknown  is  often  called  the  unknown  word  threshold. 
Unfortunately,  this  term  can  also  refer  to  the  frequency  above  which  words  are  consid¬ 
ered  known.  As  it  happens,  the  unknown  word  threshold  Collins  uses  in  his  parser  for 
English  is  6,  not  5. 14  To  be  absolutely  unambiguous,  words  that  occur  fewer  than  6  times, 
which  is  to  say  words  that  occur  5  times  or  fewer  in  the  data  are  considered  "unknown". 


4.3.2  Not  handled  in  a  uniform  way  The  obvious  way  to  incorporate  unknown  words 
into  the  parsing  model,  then,  is  simply  to  map  all  low-frequency  words  in  the  training 
data  to  some  special  +UNKNOWN+  token  before  counting  top-level  events  for  parame¬ 
ter  estimation  (where  "low-frequency"  means  "below  the  unknown  word  threshold"). 
Collins'  trainer  actually  does  not  do  this.  Instead,  it  does  not  directly  modify  any  of 
the  words  in  the  original  training  trees,  and  proceeds  to  break  up  these  unmodified 
trees  into  the  top-level  events.  After  these  events  have  been  collected  and  counted,  the 
trainer  selectively  maps  low-frequency  words  when  deriving  counts  for  the  various  con¬ 
text  (back-off)  levels  of  the  parameters  that  make  use  of  bilexical  statistics.  If  this  map¬ 
ping  were  performed  uniformly,  then  it  would  be  identical  to  mapping  low-frequency 
words  prior  to  top-level  event  counting;  this  is  not  the  case,  however.  We  describe  the 
details  of  this  unknown-word  mapping  in  §5.8.2. 

While  there  is  a  negligible  yet  detrimental  effect  on  overall  parsing  performance 
when  one  uses  an  unknown  word  threshold  of  5  instead  of  6,  when  this  change  is  com¬ 
bined  with  the  "obvious"  method  for  handling  unknown  words,  there  is  actually  a  mi¬ 
nuscule  improvement  in  overall  parsing  performance  (see  §7.1,  Figure  16). 


13This  phrase  was  taken  from  a  comment  in  one  of  Collins'  preprocessing  Perl  scripts. 

14As  with  many  of  the  discovered  discrepancies  between  the  thesis  and  the  implementation,  we  deter¬ 
mined  the  different  unknown  word  threshold  by  reverse-engineering,  in  this  case  by  an  analysis  of  the  events 
output  by  Collins'  trainer. 
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5.  Parameter  Classes  and  Their  Estimation 

All  parameters  that  generate  trees  in  Collins'  model  are  estimates  of  conditional  prob¬ 
abilities.  Even  though  the  following  overview  of  parameter  classes  presents  only  the 
maximal  contexts  of  the  conditional  probability  estimates,  it  is  important  to  bear  in  mind 
that  the  model  always  makes  use  of  smoothed  probability  estimates  that  are  the  linear 
interpolation  of  several  raw  maximum-likelihood  estimates,  using  various  amounts  of 
context  (we  will  explore  smoothing  in  detail  in  §5.7). 

Mapped  versions  of  the  set  of  nonterminals 

In  §§3.5  and  3.9,  we  saw  how  the  raw  Treebank  nonterminal  set  is  expanded  to  in¬ 
clude  nonterminals  augmented  with  -A  and  -g.  Although  it  is  not  made  explicit  in 
Collins'  thesis,  Collins'  model  uses  two  mapping  functions  to  remove  these  augmenta¬ 
tions  when  including  nonterminals  in  the  history  contexts  of  conditional  probabilities. 
Presumably  this  was  done  to  help  alleviate  sparse  data  problems.  We  will  notate  the 
"argument  removal"  mapping  function  as  a  and  the  "gap  removal"  mapping  function 
as  7.  For  example, 

•  a(NP-A-g)  =  NP-g, 

•  7(NP-A-g)  =  NP-Aand 

•  a(7(NP-A-g))  =  NP. 

Since  gap  augmentations  are  only  present  in  Model  3,  the  7  function  effectively  is  the 
identity  function  in  the  context  of  Models  1  and  2. 

5.1  The  head  parameter  class 

The  head  nonterminal  is  generated  conditioning  on  its  parent  nonterminal  label,  as  well 
as  the  head  word  and  head  tag  which  they  share,  since  parents  inherit  their  lexical  head 
information  from  their  head  children.  More  specifically,  an  unlexicalized  head  nonter¬ 
minal  label  is  generated  conditioning  on  the  fully-lexicalized  parent  nonterminal.  We 
notate  the  parameter  class  as  follows: 

PH(H\1(P),wh,th)  (4) 


5.2  The  subcat  parameter  class 

When  the  model  generates  a  head  child  nonterminal  for  some  lexicalized  parent  non¬ 
terminal,  it  also  generates  a  kind  of  subcategorization  frame  (subcat)  on  either  side  of 
the  head  child,  with  the  following  maximal  context: 

p subcat L  ( subcat L I  a  (7  (H ) ) ,  a  (7(P) ) ,  wh ,  th )  (5) 

p subcat R  ( subcat R I  a  (7  (H ) ) ,  a  (7  (P) ) ,  wh ,  th )  (6) 

Probabilistically,  it  is  as  though  these  subcats  are  generated  with  the  head  child,  via 
application  of  the  chain  rule,  but  they  are  conditionally  independent.15  These  subcats 
may  be  thought  of  as  lists  of  requirements  on  a  particular  side  of  a  head.  For  example, 
in  Figure  8,  after  the  root  node  of  the  tree  has  been  generated  (see  §5.9),  the  head  child 
VP  is  generated,  conditioning  on  both  the  parent  label  S  and  the  head  word  of  that 

15Using  separate  steps  to  generate  subcats  on  either  side  of  the  head  not  only  allows  for  conditional  inde¬ 
pendence  between  the  left  and  right  subcats,  but  also  allows  for  these  parameters  to  be  separately  smoothed 
from  the  head-generation  parameter. 
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S(sat-VBD) 

NP-A(JohtC-NNP)^^\^(^-VBD) 
NNP(John-NNP)  VBD(sat-VBD) 
John  sat 

Figure  8 

A  fully  lexicalized  tree.  The  VP  node  is  the  head  child  of  S  . 


parent,  sat-VBD.  Before  any  modifiers  of  the  head  child  are  generated,  both  a  left-  and 
right-subcat  frame  are  generated.  In  this  case,  the  left  subcat  is  {NP-A}  and  the  right 
subcat  is  {  } ,  meaning  that  there  are  no  required  elements  to  be  generated  on  the  right 
side  of  the  head.  Subcats  do  not  specify  the  order  of  the  required  arguments.  They  are 
dynamically-updated  multisets:  when  a  requirement  has  been  generated,  it  is  removed 
from  the  multiset  and  subsequent  modifiers  are  generated  conditioning  on  the  updated 
multiset.16 

The  implementation  of  subcats  in  Collins'  parser  is  even  more  specific:  subcats  are 
multisets  containing  various  numbers  of  precisely  six  types  of  items:  NP-A,  S-A,  SBAR- 
A,  VP -A,  g  and  miscellaneous.  The  "g"  indicates  that  a  gap  must  be  generated,  and  is 
applicable  only  to  Model  3.  Miscellaneous  requirements  include  all  nonterminals  that 
were  marked  as  arguments  in  the  training  data  that  were  not  any  of  the  other  named 
types.  There  are  rules  for  determining  whether  NPs,  Ss,  SBARs  and  VPs  are  arguments, 
and  the  miscellaneous  arguments  occur  as  the  result  of  the  argument-finding  rule  for 
PPs,  which  states  that  the  first  non-PRN,  non-part-of-speech  tag  that  occurs  after  the 
head  of  a  PP  should  be  marked  as  an  argument,  and  therefore  nodes  that  are  not  one  of 
the  four  named  types  can  be  marked. 

5.3  The  modifying  nonterminal  parameter  class 

As  mentioned  above,  after  a  head  child  and  its  left  and  right  subcats  are  generated,  mod¬ 
ifiers  are  generated  from  the  head  outward,  as  indicated  by  the  modifier  nonterminal 
indices  in  Figure  1.  A  fully-lexicalized  nonterminal  has  three  components:  the  non¬ 
terminal  label,  the  head  word  and  its  part  of  speech.  The  generation  of  fully-lexicalized 
modifying  nonterminals  is  done  in  two  steps,  to  allow  for  the  parameters  to  be  indepen¬ 
dently  smoothed,  which,  in  turn,  is  done  to  avoid  sparse  data  problems.  These  two  steps 
estimate  the  joint  event  of  all  three  components  using  the  chain  rule.  First,  a  partially- 
lexicalized  version  of  the  nonterminal  is  generated,  consisting  of  the  unlexicalized  label 
plus  the  part  of  speech  of  its  head  word.  These  partially-lexicalized  modifying  nonter¬ 
minals  are  generated  conditioning  on  the  parent  label,  the  head  label,  the  head  word,  the 
head  tag,  the  current  state  of  the  dynamic  subcat  and  a  distance  metric.  Symbolically, 
the  parameter  classes  are 


PL(L(t)i  |  a(P),'y(H),  wh,  th ,  subcatL,  AL)  (7) 

p 'n(R(t) i  |  a(p) 7 (H),wh,th,  subcat R,  A. R)  (8) 


16Our  parsing  engine  allows  an  arbitrary  mechanism  for  storage  and  discharge  of  requirements:  they 
can  be  multisets,  ordered  lists,  integers  (simply  to  constrain  the  number  of  requirements),  or  any  other  mech¬ 
anism.  The  mechanism  used  is  determined  at  run-time. 
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Figure  9 

A  tree  containing  both  punctuation  and  conjunction. 

where  A  denotes  the  distance  metric.17  As  discussed  above,  one  of  the  two  components 
of  this  distance  metric  is  the  vi  ("verb  intervening")  predicate.  The  other  is  a  predicate 
that  simply  reports  whether  the  current  modifier  is  the  first  modifier  being  generated, 
that  is,  whether  i  =  1.  The  second  step  is  to  generate  the  head  word  itself,  where,  be¬ 
cause  of  the  chain  rule,  the  conditioning  context  consists  of  everything  in  the  histories 
of  equations  (7)  and  (8)  plus  the  partially-lexicalized  modifier.  As  there  are  some  inter¬ 
esting  idiosyncrasies  with  these  head-word-generation  parameters,  we  describe  them 
in  more  detail  in  §5.8. 

5.4  The  punctuation  and  coordinating  conjunction  parameter  classes 
5.4.1  Inconsistent  model  As  mentioned  in  §3.8,  punctuation  is  raised  to  the  highest 
position  in  the  tree.  This  means  that,  in  some  sense,  punctuation  acts  very  much  like 
a  coordinating  conjunction,  in  that  it  "conjoins"  the  two  siblings  between  which  it  sits. 
Observing  that  it  might  be  helpful  for  conjunctions  to  be  generated  conditioning  on  both 
of  their  conjuncts,  Collins  introduced  two  new  parameter  classes  in  his  thesis  parser, 
Ppunc  and  Pec-18 

As  per  the  definition  of  a  conjoined  phrase  in  §3.1,  conjunction  via  a  CC  node  or 
a  punctuation  node  always  occurs  post-head  (i.e.,  as  a  right  sibling  of  the  head).  Put 
another  way,  if  a  conjunction  or  punctuation  mark  occurs  pre-head,  it  is  not  generated 
via  this  mechanism.19  Furthermore,  even  if  there  is  arbitrary  material  between  the  right 
conjunct  and  the  head,  the  parameters  effectively  assume  that  the  left  conjunct  is  always 
the  head  child.  For  example,  in  Figure  9,  the  rightmost  NP  ("bushy  bushes")  is  consid¬ 
ered  to  be  conjoined  to  the  leftmost  NP  ("short  grass"),  which  is  the  head  child,  even 
though  there  is  an  intervening  NP  ("tall  trees"). 

The  new  parameters  are  incorporated  into  the  model  by  requiring  that  all  modifying 
nonterminals  be  generated  with  two  boolean  flags:  coord,  indicating  that  the  nonter¬ 
minal  is  conjoined  to  the  head  via  a  CC,  and  punc,  indicating  that  the  nonterminal  is 
conjoined  to  the  head  via  a  punctuation  mark.  When  either  or  both  of  these  flags  is  true, 
the  intervening  punctuation  or  conjunction  is  generated  via  appropriate  instances  of  the 

17Throughout  this  paper  we  will  use  the  notation  L(w,t)i  to  refer  to  the  three  items  that  comprise  a 
fully-lexicalized  left-modifying  nonterminal,  which  are  the  unlexicalized  label  Li,  its  head  word  wi,i  and 
its  part  of  speech  ,  and  similarly  for  right  modifiers.  We  use  L(t)i  to  refer  to  the  two  items  Li  and  tz,i 
of  a  partially-lexicalized  nonterminal.  Finally,  when  we  do  not  wish  to  distinguish  between  a  left  and  right 
modifier,  we  use  M(w,  t)i,  M(t)i  and  Mj. 

18Collins'  thesis  does  not  say  what  the  back-off  structure  of  these  new  parameter  classes  is,  i.e.,  how  they 
should  be  smoothed.  We  have  included  this  information  in  the  complete  smoothing  table  in  the  Appendix. 

19In  fact,  if  punctuation  occurs  before  the  head,  it  is  not  generated  at  all — a  deficiency  in  the  parsing 
model  that  appears  to  be  a  holdover  from  the  deficient  punctuation  handling  in  the  model  of  (Collins,  1997). 
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Figure  10 

The  Collins  model  assigns  equal  probability  to  these  three  trees. 


NPB  ,  CC  ,  AD  VP. 


Ppunc/Pcc  parameter  classes. 

For  example,  the  model  generates  the  five  children  in  Figure  9  in  the  following 
order:  first,  the  head  child  is  generated,  which  is  the  leftmost  NP  ("short  grass"),  con¬ 
ditioning  on  the  parent  label  and  the  head  word  and  tag.  Then,  since  modifiers  are 
always  generated  from  the  head  outward,  the  right  sibling  of  the  head,  which  is  the 
"tall  trees"  NP,  is  generated  with  both  the  punc  and  CC  flags  false.  Then,  the  right¬ 
most  NP  ("bushy  bushes")  is  generated  with  both  the  punc  and  CC  booleans  true, 
since  it  is  considered  to  be  conjoined  to  the  head  child  and  requires  the  generation  of 
an  intervening  punctuation  mark  and  conjunction.  Finally,  the  intervening  punctuation 
is  generated  conditioning  on  the  parent,  the  head  and  the  right  conjunct,  including  the 
head  words  of  the  two  conjoined  phrases,  and  the  intervening  CC  is  similarly  generated. 
A  simplified  version  of  the  probability  of  generating  all  these  children  is  summarized 
as  follows: 


Pif  (NP  |  NP,  grass,  NN)- 

pjj(NP(trees,  NNS),  punc  =  0,  coord  =  0  |  NP,  NP,  grass,  NN)- 
P(r(NP (bushes,  NNS),  punc  =  1,  coord  =  1 1  NP,  NP,  grass,  NN)- 
Ppunc ( t  (, )  I  NP,  NP,  NP,  bushes,  NNS,  grass,  NN)- 

Pcc(CC(and)  |  NP,NP,NP,  bushes,  NNS,  grass,  NN)  (9) 

The  idea  is  that,  using  the  chain  rule,  the  generation  of  two  conjuncts  and  that  which 
conjoins  them  is  estimated  as  one,  large  joint  event.20 

This  scheme  of  using  flags  to  trigger  the  Ppunc  /  Pec  parameters  is  problematic,  at 
least  from  a  theoretical  standpoint,  as  it  causes  the  model  to  be  inconsistent.  Figure 
10  shows  three  different  trees  that  would  all  receive  the  same  probability  from  Collins' 
model.  The  problem  is  that  coordinating  conjunctions  and  punctuation  are  not  gener¬ 
ated  as  first-class  words,  but  only  as  triggered  from  these  punc  and  coord  flags,  mean¬ 
ing  that  the  number  of  such  intervening  conjunctive  items  (and  the  order  in  which  they 
are  to  be  generated)  is  not  specified.  So,  for  a  given  sentence/ tree  pair  containing  a  con¬ 
junction  and /or  a  punctuation  mark,  there  is  an  infinite  number  of  similar  sentence / tree 
pairs  with  arbitrary  amounts  of  "conjunctive"  material  between  the  same  two  nodes. 
Because  all  of  these  trees  have  the  same,  non-zero  probability,  the  sum  J2tP(T)  di¬ 
verges,  where  T  is  a  possible  tree  generated  by  the  model,  meaning  the  model  is  incon¬ 
sistent  (Booth  and  Thompson,  1973).  Another  consequence  of  not  generating  post-head 
conjunctions  and  punctuation  as  first-class  words  is  that  they  do  not  count  when  calcu¬ 
lating  the  head-adjacency  component  of  Collins'  distance  metric. 

When  emulating  Collins'  model,  instead  of  reproducing  the  PpUnc  and  Pec  param¬ 
eter  classes  directly  in  our  parsing  engine,  we  chose  to  use  a  different  mechanism  that 
does  not  yield  an  inconsistent  model,  but  still  estimates  the  large  joint  event  that  was 

20In  (9),  for  clarity  we  have  left  out  subcat  generation  and  the  use  of  Collins'  distance  metric  in  the 
conditioning  contexts.  We  have  also  glossed  over  the  fact  that  lexicalized  modifying  nonterminals  are  actually 
generated  in  two  steps,  using  two  differently-smoothed  parameters. 
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the  motivation  behind  these  parameters  in  the  first  place. 

5.4.2  History  mechanism  In  our  emulation  of  Collins'  model,  we  use  the  history  to  es¬ 
timate  the  joint  event  of  generating  a  conjunction  (or  punctuation  mark)  and  its  two 
conjuncts.  The  first  big  change  is  that  we  treat  punctuation  preterminals  and  CCs  as 
first-class  objects,  meaning  that  they  are  generated  just  like  any  other  modifying  non¬ 
terminal.  The  second  change  is  a  little  more  involved. 

First,  we  redefine  the  distance  metric  to  consist  solely  of  the  vi  predicate.  Then,  we 
add  to  the  conditioning  context  a  mapped  version  of  the  previously-generated  modifier 
according  to  the  following  mapping  function: 


'+START+  if'  i  =  0, 

«i=  cc  itM‘  =  ct 

+PUNC+  if  Mi  =  ,  or  Mi  =  : , 

k+0THER+  otherwise. 


(10) 


where  M,  is  some  modifier  L,  or  R,.21  So,  our  the  maximal  context  for  our  modifying 
nonterminal  parameter  class  is  now  defined  as  follows: 

PM  (M(t)i\a(P),'y(H),wh,th,subcatside,vi(Mi),d(Mi_1),side)  (11) 

where  side  is  a  boolean-valued  event  that  indicates  whether  the  modifier  is  on  the  left 
or  right  side  of  the  head.  By  treating  CC  and  punctuation  nodes  as  first-class  nontermi¬ 
nals  and  by  adding  the  mapped  version  of  the  previously-generated  modifier,  we  have, 
in  one  fell  swoop,  incorporated  the  "no  intervening"  component  of  Collins'  distance 
metric  (the  i  =  0  case  of  the  S  function)  and  achieved  an  estimate  of  the  joint  event  of  a 
conjunction  and  its  conjuncts,  albeit  with  different  dependencies,  i.e.,  a  different  appli¬ 
cation  of  the  chain  rule.  To  put  this  parameterization  change  in  sharp  relief,  consider 
the  abstract  tree  structure 


P 


.  H  CC  Ri 

To  a  first  approximation,  under  the  old  parameterization,  the  conjunction  of  some  node 
Ri  with  a  head  II  and  a  parent  P  looked  like 

Ph(H\P)  -Pr(Ri,  coord  =  l|P,iT)  ■  pCc(CC\  P,  H,  R±), 

whereas  under  the  new  parameterization,  it  looks  like 

Ph(H  |  P)  •  pR( CC  |  P,  H ,  +START+)  •  pR(R1  |  P,  H ,  CC). 

Either  way,  the  probability  of  the  joint  conditional  event  { H ,  CC,  Pi  |  P}  is  being  esti¬ 
mated,  but  with  the  new  method,  there  is  no  need  to  add  two,  new  specialized  param- 

21  Originally,  we  also  had  an  additional  mechanism  that  attempted  to  generate  punctuation  and  conjunc¬ 
tions  with  conditional  independence.  One  of  our  reviewers  astutely  pointed  out  that  the  mechanism  led  to 
a  deficient  model  (the  very  thing  we  have  been  trying  to  avoid),  and  so  we  have  subsequently  removed  it 
from  our  model.  The  removal  leads  to  a  0.05%  absolute  reduction  in  F-measure  (which  in  this  case  is  also  a 
0.05%  relative  increase  in  error)  on  sentences  of  length  <  40  words  in  §00  of  the  Penn  Treebank.  As  this  dif¬ 
ference  is  not  at  all  statistically  significant  (according  to  a  randomized  stratified  shuffling  test  (Cohen,  1995)), 
all  evaluations  reported  in  this  paper  are  with  the  original  model. 
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eter  classes,  and  it  does  not  introduce  inconsistency  into  the  model.  Using  less  simplifi¬ 
cation,  the  probability  of  generating  the  five  children  of  Figure  9  is  now 

Ph( NP  |  NP,  grass,  NN)- 

PM(NP(trees,  NNS)  |  NP,  NP,  grass,  NN,  {},  false,  +START+,  right)- 
Pm(,  (, , , )  |  NP,NP,  grass,  NN,  {},  false,  +0THER+,  right)- 
PM(CC(and,  CC)  |  NP,  NP,  grass,  NN,  {},  false,  +PUNC+,  right)- 
PM(NP(bushes,  NNS)  |  NP,  NP,  grass,  NN,  {},  false,  CC,  right)  (12) 

As  shown  in  §7.1,  this  new  parameterization  yields  virtually  identical  performance  to 
that  of  the  Collins  model.22 

5.5  The  base  NP  model:  a  model  unto  itself 

As  we  have  already  seen,  there  are  several  ways  in  which  base  NPs  are  exceptional 
in  Collins'  parsing  model.  This  is  partly  because  the  flat  structure  of  base  NPs  in  the 
Penn  Treebank  suggested  the  use  of  a  completely  different  model  by  which  to  generate 
them.  Essentially,  the  model  for  generating  children  of  NPB  nodes  is  a  "bigrams  of 
nonterminals"  model.  That  is,  it  looks  a  great  deal  like  a  bigram  language  model,  except 
that  the  items  being  generated  are  not  words,  but  lexicalized  nonterminals.  Heads  of 
NPB  nodes  are  generated  using  the  normal  head-generation  parameter,  but  modifiers 
are  generated  always  conditioning  not  on  the  head,  but  on  the  previously-generated 
modifier.  That  is,  we  modify  equations  (7)  and  (8)  to  be 

PL,wB(L(t)i\P,L(w,t)i-i)  (13) 

Pn,K?B(R(t)i  I P,  R(w,  t)i- 1)  (14) 

Though  it  is  not  entirely  spelled  out  in  his  thesis,  Collins  considers  the  previously- 
generated  modifier  to  be  the  head  child,  for  all  intents  and  purposes.  Thus,  the  subcat 
and  distance  metrics  are  always  irrelevant,  since  it  is  as  though  the  current  modifier 
is  right  next  to  the  head.23  Another  consequence  of  this  is  that  NPBs  are  never  con¬ 
sidered  to  be  coordinated  phrases  (as  mentioned  in  §3.12),  and  thus  CCs  dominated 
by  NPB  are  never  generated  using  a  Pec  parameter;  instead,  they  are  generated  using 
a  normal  modifying-nonterminal  parameter.  Punctuation  dominated  by  NPB,  on  the 
other  hand,  is  still,  as  always,  generated  via  Ppunc  parameters,  but  where  crucially,  the 
modifier  is  always  conjoined  (via  the  punctuation  mark)  to  the  "pseudo-head"  that  is 
the  previously-generated  modifier.  Consequently,  when  generating  some  right  modi¬ 
fier  Ri,  the  previously-generated  modifier  on  the  right  side  of  the  head,  /?,_  i ,  is  never  a 
punctuation  preterminal,  but  always  the  previous  "real"  (i.e.,  non-punctuation)  preter¬ 
minal.24 

Base  NPs  are  also  exceptional  with  respect  to  determining  chart  item  equality,  the 
comma  pruning  rule  and  general  beam-pruning  (see  §6.2  for  details). 

5.6  Parameter  classes  for  priors  on  lexicalized  nonterminals 

Two  parameter  classes  that  only  make  their  appearance  in  Appendix  E  of  Collins'  thesis 
are  those  that  compute  priors  on  lexicalized  nonterminals.  These  priors  are  used  as  a 

22 As  described  in  (Bikel,  2002),  our  parsing  engine  allows  easy  experimentation  with  a  wide  variety 
of  different  generative  models,  including  the  ability  to  construct  history  contexts  from  arbitrary  numbers 
of  previously-generated  modifiers.  The  mapping  function  5  and  the  transition  function  r  presented  in  this 
section  are  just  two  examples  of  this  capability. 

23This  is  the  main  reason  that  the  cv  ("contains  verb")  predicate  is  always  false  for  NPBs,  as  that 
predicate  only  applies  to  material  that  intervenes  between  the  current  modifier  and  the  head. 

24  Interestingly,  unlike  the  regular  model,  punctuation  that  occurs  to  the  left  of  the  head  is  generated  when 
it  occurs  within  an  NPB.  Thus,  this  particular — albeit  small — deficiency  of  Collins'  punctuation-handling  does 
not  apply  to  the  base  NP  model. 
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crude  proxy  for  the  outside  probability  of  a  chart  item  (see  (Baker,  1979;  Lari  and  Young, 
1990)  for  full  descriptions  of  the  Inside-Outside  algorithm).  Previous  work  (Goodman, 
1997)  has  shown  that  the  inside  probability  alone  is  an  insufficient  scoring  metric  when 
comparing  chart  items  covering  the  same  span  during  decoding,  and  that  some  estimate 
of  the  outside  probability  of  a  chart  item  should  be  factored  into  the  score.  A  prior  on  the 
root  (lexicalized)  nonterminal  label  of  the  derivation  forest  represented  by  a  particular 
chart  item  is  used  for  this  purpose  in  Collins'  parser. 

The  prior  of  a  lexicalized  nonterminal  M(w,t)  is  broken  down  into  two,  separate 
estimates  using  parameters  from  two  new  classes,  Pprior „  and  PpriorNT  '■ 

Pprior(.M(W)ty)  —  Ppriorw('W>t)  Pprior  wr  (At  |  W:  t) 

where  p{M  |  w,  t )  is  smoothed  with  p(M  \  t),  and  where  estimates  using  the  parameters 
of  the  PPrior „  class  are  unsmoothed. 

5.7  Smoothing  weights 

Many  of  the  parameter  classes  in  Collins'  model — and  indeed,  in  most  statistical  pars¬ 
ing  models — define  conditional  probabilities  with  very  large  conditioning  contexts.  In 
this  case,  the  conditioning  contexts  represent  some  subset  of  the  history  of  the  genera¬ 
tive  process.  Even  if  there  were  orders  of  magnitude  more  training  data  available,  the 
large  size  of  these  contexts  would  cause  horrendous  sparse  data  problems.  The  solution 
is  to  smooth  these  distributions  that  are  made  rough  primarily  by  the  abundance  of  ze¬ 
ros.  Collins  uses  the  technique  of  deleted  interpolation,  which  smoothes  the  distributions 
based  on  full  contexts  with  coarser  models  that  use  less  of  the  context,  by  successively 
deleting  elements  from  the  context  at  each  back-off  level.  As  a  simple  example,  the 
head  parameter  class  smoothes  Ph0  {Pd  \  P,  w^,  t h)  with  Ph j  {H  \  P,  t h)  and  Ph2  {H  \P). 
For  some  conditional  probability  p{A  \  D),  let  us  call  the  reduced  context  at  the  ith  back¬ 
off  level  4>i{B ),  where  typically  <po ( B )  =  B.  Each  estimate  in  the  back-off  chain  is 
computed  via  maximum  likelihood  estimation,  and  the  overall  smoothed  estimate  is 
computed  using  n  —  1  smoothing  weights  for  n  back-off  levels,  denoted  Ao , . . . ,  Xn-2  ■ 
These  weights  are  used  in  a  recursive  fashion:  the  smoothed  version  e*  =  p,  [A  |  (p,(B)) 
of  an  unsmoothed  ML  estimate  e,  =  Pi(A  \<Pi(B))  at  back-off  level  i  is  computed  via  the 
formula 

Si  =  A iet  +  (1  -  Ai)ei+i,  0  <  i  <  n  -  1,  e„_i  =  e„_i.  (15) 

So,  for  example,  with  three  levels  of  back-off,  the  overall  smoothed  estimate  would  be 
defined  as 

So  =  Aoeo  +  (1  —  Ao)  [Aiei  +  (1  —  Ai)e2]  (16) 

It  is  easy  to  prove  by  structural  induction  that  if 

0  <  A*  <  1  and  ^p*(A  |  cf>i{B ))  =  1,  0  <  i  <  n  —  1, 

A 

then 

J2po{A\MB))  =  1.  (17) 

A 

Each  smoothing  weight  can  be  conceptualized  as  the  confidence  in  the  estimate 
with  which  it  is  being  multiplied.  These  confidence  values  can  be  derived  in  a  number 
of  sensible  ways;  the  technique  used  by  Collins  was  adapted  from  that  used  in  (Bikel 
et  al.,  1997),  which  makes  use  of  a  quantity  called  the  diversity  of  the  history  context 
(Witten  and  Bell,  1991),  which  is  equal  to  the  number  of  unique  futures  observed  in 
training  for  that  history  context. 
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5.7.1  Deficient  model  As  mentioned,  n  back-off  levels  require  n  —  1  smoothing  weights. 
Collins'  parser  effectively  uses  n  weights,  because  the  estimator  always  adds  an  ex¬ 
tra,  constant-valued  estimate  to  the  back-off  chain.  Collins'  parser  hard-codes  this  ex¬ 
tra  value  to  be  a  vanishingly-small  (but  non-zero)  "probability"  of  10-19,  resulting  in 
smoothed  estimates  of  the  form 

Co  =  Aoeo  +  (1  —  Ao)  [Aiei  +  (1  —  Ai)  [A2e2  +  (1  —  A2)  •  10  19]]  (18) 

when  there  are  three  levels  of  back-off.  The  addition  of  this  constant-valued  en  =  10-19 
causes  all  estimates  in  the  parser  to  be  deficient,  as  it  ends  up  throwing  away  probability 
mass.  More  formally,  the  proof  leading  to  Equation  (17)  no  longer  holds:  the  "distribu¬ 
tion"  sums  to  less  than  one  (there  is  no  history  context  in  the  model  for  which  there  are 
1019  possible  outcomes).25 

5.7.2  Smoothing  factors  and  smoothing  terms  The  formula  given  in  Collins'  thesis  for 
computing  smoothing  weights  is 


Ci  +  5  m 

where  c,  is  the  count  of  the  history  context  (f  (B)  and  iq  is  the  diversity  of  that  context.26 
The  multiplicative  constant  5  is  used  to  give  less  weight  to  the  back-off  levels  with  more 
context,  and  was  optimized  by  looking  at  overall  parsing  performance  on  the  develop¬ 
ment  test  set.  Section  00  of  the  Penn  Treebank.  We  call  this  constant  the  smoothing  factor, 
and  denote  it  as  f  j .  As  it  happens,  the  actual  formula  for  computing  smoothing  weights 
in  Collins'  implementation  is 


{Ci 

ci+h+ffUi 

0 


if  Ci  >  0, 
otherwise. 


(19) 


where  ft  is  an  unmentioned  smoothing  term.  For  every  parameter  class  except  the  subcat 
parameter  class  and  Ppriorw,  ft  =  0  and  //  =  5.0.  For  the  subcat  parameter  class, 
ft  =  5.0  and  ff  =  0.  For  Pprior„,  ft  =  1-0  and  ff  =  0.0.  This  curiously  means  that 
diversity  is  not  used  at  all  when  smoothing  subcat-generation  probabilities.27 

The  second  case  in  (19)  handles  the  situation  where  the  history  context  was  never 
observed  in  training,  i.e.,  where  c,  =  u,  =  0,  which  would  yield  an  undefined  value 

25  Collins  used  this  technique  to  ensure  that  even  futures  that  were  never  seen  with  an  observed  his¬ 
tory  context  would  still  have  some  probability  mass,  albeit  a  vanishingly  small  one  (Collins,  p.c.).  Another 
commonly-used  technique  would  be  to  back-off  to  the  uniform  distribution,  which  has  the  desirable  property 
of  not  producing  deficient  estimates.  As  with  all  of  the  Treebank-  or  model-specific  aspects  of  the  Collins 
parser,  our  engine  uses  Equation  (16)  or  (18)  depending  on  the  value  of  a  run-time  setting. 

26The  smoothing  weights  can  be  viewed  as  confidence  values  for  the  probability  estimates  with  which 
they  are  multiplied.  The  Witten-Bell  technique  crucially  makes  use  of  the  quantity  nl  =  — ,  the  average 
number  of  transitions  from  the  history  context  to  a  possible  future.  With  a  little  algebraic  manipulation, 

we  have 


a  quantity  that  is  at  its  maximum  when  rg  =  C{  and  at  its  minimum  when  rg  =  1,  that  is,  when  every  future 
observed  in  training  was  unique.  This  latter  case  represents  when  the  model  is  most  "uncertain",  in  that  the 
transition  distribution  from  4>i(B)  is  uniform  and  poorly-trained  (1  observation  per  possible  transition).  Be¬ 
cause  these  smoothing  weights  measure,  in  some  sense,  the  closeness  of  the  observed  distribution  to  uniform, 
they  can  be  viewed  as  proxies  for  the  entropy  of  the  distribution  p(  -  \ 

27 As  mentioned  above,  the  Ppriorw  parameters  are  unsmoothed.  However,  due  to  the  deficient  esti¬ 
mation  method,  they  still  have  an  associated  lambda  value,  the  computation  of  which,  just  like  the  subcat- 
generation  probability  estimates,  does  not  make  use  of  diversity. 
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Back-off 

Pl„{w, Li  |  ...) 

level 

Pr„  ( wRi  |  . . .) 

0 

7 (Li) ,  tLi ,  coord,  punc,  a(P) ,  y(iT) ,  wh ,  th ,  AL,  subcat 

1 

coord,  punc,  a(P),'y(H),th,  AL,  subcat 

2 

lLi 

Figure  11 

Back-off  levels  for  Pl„  / Prw  ,  the  modifier  head-word  generation  parameter  classes.  wl{  and  t 
are  respectively  the  head  word  and  its  part  of  speech  of  the  nonterminal  Lt .  This  table  is 
basically  a  reproduction  of  the  last  column  of  Table  7.1  in  Collins'  thesis. 


Back-off  level 

Pm„  {wMi  |  •  •  •) 

0 

)(Mi),  tMi ,  Oi(P),  7 (H),wh,th,  subcat side,  vi(M;),  <5(M;_i),  side 

1 

tMi ,  a(P),  7 (H),th,  subcat8ide,  vi (Mi),  <5(M;_i),  side 

2 

tMi 

Figure  12 

Our  new  parameter  class  for  the  generation  of  head  words  of  modifying  nonterminals. 


when  ft  =  0.  In  such  situations,  by  making  A,-  =  0,  all  remaining  probability  mass 
gets  thrown  to  the  smoothed  back-off  estimate,  e,+i.  This  is  a  crucial  part  of  the  way 
smoothing  is  done:  if  a  particular  history  context  (f>i(B )  has  never  been  observed  in 
training,  the  smoothed  estimate  using  less  context,  \  (B),  is  simply  substituted  as 
the  "best  guess"  for  the  estimate  using  more  context;  that  is,  e,  =  ej+i  .28 

5.8  Modifier  head-word  generation 

As  mentioned  above  in  §5.3,  fully-lexicalized  modifying  nonterminals  are  generated  in 
two  steps.  First,  the  label  and  part-of-speech  tag  are  generated  with  an  instance  of  Pl  or 
PR.  Next,  the  head  word  is  generated  via  an  instance  of  one  of  two  parameter  classes, 
P/,ic  or  P/fu. .  The  back-off  contexts  for  the  smoothed  estimates  of  these  parameters  is 
specified  in  Figure  11.  Notice  how  the  last  level  of  back-off  is  markedly  different  from 
the  previous  two  levels,  in  that  it  removes  nearly  all  the  elements  of  the  history:  in 
the  face  of  sparse  data,  the  probability  of  generating  the  head  word  of  a  modifying 
nonterminal  is  conditioned  only  on  its  part  of  speech. 

5.8.1  Smoothing  and  the  last  level  of  back-off  The  table  of  Figure  11  is  misleading, 
however.  In  order  to  capture  the  most  data  for  the  crucial  last-level  of  back-off,  Collins 
uses  words  that  occur  on  either  side  of  the  head  word,  resulting  in  a  general  estimate 
p(w  1 t),  as  opposed  to  j>lic  (u'L;  |  fij.  Accordingly,  in  our  emulation  of  Collins  model, 
we  replace  the  left-  and  right-word  parameter  classes  with  a  single  modifier  head-word 
generation  parameter  class  that,  as  with  (11),  includes  a  boolean  side  component  that  is 
deleted  from  the  last  level  of  back-off.  See  Figure  12. 

Even  with  this  change,  there  is  still  a  problem.  Every  head  word  in  a  lexicalized 
parse  tree  is  the  modifier  of  some  other  head  word — except  the  word  that  is  the  head 
of  the  entire  sentence  (i.e.,  the  head  word  of  the  root  nonterminal).  In  order  to  prop¬ 
erly  duplicate  Collins'  model,  an  implementation  must  take  care  that  the  P(w  \  t)  model 
includes  counts  for  these  important  head  words.29 

28  This  fact  will  be  crucial  toward  understanding  how  little  the  Collins  parsing  model  relies  on  bilexical 
statistics,  as  described  in  §7.2  and  the  supporting  experiment  shown  in  Figure  17. 

29In  our  implementation,  we  add  such  counts  by  having  our  trainer  generate  a  "fake"  modifier  event 
where  the  observed,  lexicalized  root  nonterminal  is  considered  a  modifier  of  +TOP+,  the  hidden  nonterminal 
that  is  the  parent  of  the  observed  root  of  every  tree  (see  §5.9  for  details  on  the  +TOP+  nonterminal). 
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S(sat-VBD) 


NP-A(Fido-NNP)  ADVP(faithfully-RB)  VP(sat-VBD) 

NPB  (Fido-NNP)  RB  VBD 


faithfully 


sat 


Faithful  Fido 
Figure  13 

The  low-frequency  word  "Fido"  is  mapped  to  +UNKNOWN+,  but  only  when  it  is  generated,  not 
when  it  is  conditioned  upon.  All  the  nonterminals  have  been  lexicalized  (except  for 
preterminals)  to  show  where  the  heads  are. 


5.8.2  Unknown  word  mapping  As  mentioned  above,  instead  of  mapping  every  low- 
frequency  word  in  the  training  data  to  some  special  +UNKNOWN+  token,  Collins'  trainer 
instead  leaves  the  training  data  untouched,  and  instead  selectively  maps  words  that  ap¬ 
pear  in  the  back-off  levels  of  the  parameters  from  the  Pl„  and  Pr^  parameter  classes. 
Rather  curiously,  the  trainer  only  maps  words  that  appear  in  the  futures  of  these  param¬ 
eters,  but  never  in  the  histories.  Put  another  way,  low-frequency  words  are  generated  as 
+UNKNOWN+,  but  are  left  unchanged  when  they  are  conditioned  upon.  For  example,  in 
Figure  13,  where  we  assume  "Fido"  is  a  low-frequency  word,  the  trainer  would  derive 
counts  for  the  smoothed  parameter 

Pl„  (+UNKN0WN  +  I  NP-A,  NNP,  coord  =  0,  pane  =  0,  S,  VP,  sat,  VBD, . . .)  (20) 

However,  when  collecting  events  that  condition  on  "Fido",  the  word  would  not  be 
mapped,  such  as  the  parameters 

pL  (J  J(J  J)  |  NPB,  NNP,  Fido) 

Plw  (Faithful  |  JJ,  JJ,  NPB,  NNP,  Fido) 

This  strange  mapping  scheme  has  some  interesting  consequences.  First,  imagine 
what  happens  to  words  that  are  truly  unknown,  that  never  occurred  in  the  training  data. 
Such  words  are  mapped  to  the  +UNKNOWN+  token  outright  before  parsing.  Whenever 
the  parser  estimates  a  probability  with  this  truly  unknown  word  in  the  history,  it  will 
necessarily  throw  all  probability  mass  to  the  backed-off  estimate  (e\,  in  our  earlier  nota¬ 
tion),  since  +UNKNOWN+  effectively  never  occurred  in  a  history  context  during  training. 

The  second  consequence  is  that  the  mapping  scheme  yields  a  "superficient"30  model, 
if  all  other  parts  of  the  model  are  probabilistically  sound  (which  is  actually  not  the  case 
here).  With  a  parsing  model  such  as  Collins'  that  uses  bilexical  dependencies,  generat¬ 
ing  words  in  the  course  of  parsing  is  very  much  like  a  bigram  language  model:  every 
word  is  generated  conditioning  on  some  previously  generated  word,  as  well  as  some 
hidden  material.  The  only  difference  is  that  the  word  being  conditioned  upon  is  often 
not  the  immediately-preceding  word  in  the  sentence.  However,  one  could  plausibly 

30The  term  "deficient"  is  used  to  denote  a  model  in  which  one  or  more  estimated  distributions  sums  to 
less  than  1.  We  use  the  term  "superficient"  to  denote  a  model  in  which  one  or  more  estimated  distributions 
sums  to  greater  than  1 . 
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Back-off  level 

PTopNT(H(t)  |  ...) 

PtoPv,  {w  |  •  •  •) 

0 

+  TOP  + 

t,H,  +T0P+ 

1 

n/a 

t 

Figure  14 

Back-off  structure  for  Ptopnt  and  Ptop„,  which  estimate  the  probability  of  generating  H(w,  t ) 
as  the  root  nonterminal  of  a  parse  tree.  Pro pnt  is  unsmoothed. 


construct  a  consistent  bigram  language  model  that  generates  words  with  the  same  de¬ 
pendencies  as  are  in  a  statistical  parser  that  uses  bilexical  dependencies  derived  from 
head-lexicalization. 

Collins  (p.c.)  notes  that  his  parser's  unknown-word-mapping  scheme  can  be  made 
consistent  if  one  were  to  add  a  parameter  class  that  estimates  p(w  |  +UNKN0WN+),  where 
w  €  Vl  U  {+UNKN0WN+}.  The  values  of  these  estimates  for  a  given  sentence  would 
be  constant  across  all  parses,  meaning  that  the  "superficiency"  of  the  model  would  be 
irrelevant  when  determining  arg  maxP(T  |  5). 

T 


5.9  The  TOP  parameter  classes 

It  is  assumed  that  all  trees  that  can  be  generated  by  the  model  have  an  implicit  non¬ 
terminal  +TOP+  that  is  the  parent  of  the  observed  root.  The  observed,  lexicalized  root 
nonterminal  is  generated  conditioning  on  +TOP+  (which  has  a  prior  probability  of  1.0) 
using  a  parameter  from  the  class  Ptop  ■  This  special  parameter  class  is  mentioned  in  a 
footnote  in  Chapter  7  of  Collins'  thesis.  There  are  actually  two  parameter  classes  used 
to  generated  observed  roots,  one  for  generating  the  partially-lexicalized  root  nontermi¬ 
nal,  the  other  for  generating  the  head  word  of  the  entire  sentence,  which  we  will  call 
Ptopnt  and  Pro pw  /  respectively.  Figure  14  gives  the  unpublished  back-off  structure  of 
these  two  additional  parameter  classes. 

Note  that  Ptop„  backs  off  to  simply  estimating  p(w  1 1).  Technically,  it  should  be 
estimating  Pnt(w  \  t),  which  is  to  say  the  probability  of  a  word  occurring  with  a  tag  in 
the  space  of  lexicalized  nonterminals.  This  is  different  from  the  last  level  of  back-off  in 
the  modifier  head-word  parameter  classes,  which  is  effectively  estimating  p(w  \  t)  in  the 
space  of  lexicalized  preterminals.  The  difference  is  that  in  the  same  sentence,  the  same 
head  word  can  occur  with  the  same  tag  in  multiple  nodes,  such  as  "sat"  occurring  with 
the  tag  VBD  three  times  (instead  of  just  once)  in  the  tree  shown  in  Figure  8.  Despite 
this  difference,  Collins'  parser  uses  counts  from  the  (shared)  last  level  of  back-off  of 
the  Pl„  / Pr„  parameters  when  delivering  e\  estimates  for  the  Ptop„  parameters.  Our 
parsing  engine  emulates  this  "count  sharing"  for  Ptop„  by  default,  by  sharing  counts 
from  our  Pm„  parameter  class. 

6.  Decoding 

Parsing,  or  decoding,  is  performed  via  a  probabilistic  version  of  the  CKY  chart-parsing 
algorithm.  As  with  normal  CKY,  even  though  the  model  is  defined  in  a  top-down, 
generative  manner,  decoding  proceeds  bottom-up.  Collins'  thesis  gives  a  pseudocode 
version  of  his  algorithm  in  an  appendix.  This  section  contains  a  few  practical  details. 

6.1  Chart  item  equality 

Since  the  goal  of  the  decoding  process  is  to  determine  the  maximally-likely  theory,  if 
during  decoding  a  proposed  chart  item  is  equal  (or,  technically,  equivalent)  to  an  item 
that  is  already  in  the  chart,  the  one  with  the  greater  score  survives.  Chart  item  equality 
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is  closely  tied  to  the  generative  parameters  used  to  construct  theories:  we  want  to  treat 
two  chart  items  as  unequal  if  they  represent  derivation  forests  that  would  be  considered 
unequal  according  to  the  output  elements  and  conditioning  contexts  of  the  parameters 
used  to  generate  them,  subject  to  the  independence  assumptions  of  the  model.  For 
example,  for  two  chart  items  to  be  considered  equal,  they  must  have  the  same  label  (the 
label  of  the  root  of  their  respective  derivation  forests'  subtrees),  the  same  head  word  and 
tag  and  the  same  left  and  right  subcat.  The  must  also  have  the  same  head  label  (that  is, 
label  of  the  head  child). 

If  a  chart  item's  root  label  is  an  NP  node,  its  head  label  is  most  often  an  NPB  node, 
given  the  "extra"  NP  levels  that  are  added  during  preprocessing  to  ensure  that  NPB 
nodes  are  always  dominated  by  NP  nodes.  In  such  cases,  the  chart  item  will  contain  a 
back-pointer  to  the  chart  item  that  represents  the  base  NP.  Curiously,  however,  Collins' 
implementation  considers  the  head  label  of  the  NP  chart  item  not  to  be  NPB,  but  rather 
the  head  label  of  the  NPB  chart  item.  In  other  words,  to  get  the  head  label  of  an  NP  chart 
item,  one  must  "peek  through"  the  NPB  and  get  at  the  NPB's  head  label.  Presumably, 
this  was  done  as  a  consideration  for  the  NPB  nodes  being  "extra"  nodes,  in  some  sense. 
It  appears  to  have  little  effect  on  overall  parsing  accuracy,  however. 

6.2  Pruning 

Ideally,  every  parse  theory  could  be  kept  in  the  chart,  and  when  the  root  symbol  has 
been  generated  for  all  theories,  the  top-ranked  one  would  "win".  In  order  to  speed 
things  up,  Collins  employs  three  different  types  of  pruning.  The  first  form  of  pruning 
is  to  use  a  beam:  the  chart  memoizes  the  highest-scoring  theory  in  each  span,  and  if  a 
proposed  chart  item  for  that  span  is  not  within  a  certain  factor  of  the  top-scoring  item,  it 
is  not  added  to  the  chart.  Collins  reports  in  his  thesis  that  he  uses  a  beam  width  of  105. 
As  it  happens,  the  beam  width  for  his  thesis  experiments  was  104.  Interestingly,  there 
is  a  negligible  difference  in  overall  parsing  accuracy  when  this  wider  beam  is  used  (see 
§7.1,  Figure  16).  An  interesting  modification  to  the  standard  beam  in  Collins'  parser 
is  that  for  chart  items  representing  NP  or  NP-A  derivations  with  more  than  one  child, 
the  beam  is  expanded  to  be  104  •  e3.  We  suspect  that  Collins  made  this  modification  to 
handle  the  greater  perplexity  associated  with  NPs  after  he  added  the  base  NP  model. 

The  second  form  of  pruning  employed  is  using  a  comma  constraint.  Collins  ob¬ 
served  that  in  the  Penn  Treebank  data,  96%  of  the  time,  when  a  constituent  contained  a 
comma,  the  word  immediately  following  the  end  of  the  constituent's  span  was  either  a 
comma  or  the  end  of  the  sentence.  So,  for  speed  reasons,  the  decoder  rejects  all  theories 
that  would  generate  constituents  that  violate  this  comma  constraint.31  There  is  a  sub¬ 
tlety  to  Collins'  implementation  of  this  form  of  pruning,  however.  Commas  are  quite 
common  within  parenthetical  phrases.  Accordingly,  if  a  comma  in  an  input  sentence 
occurs  after  an  open  parenthesis  and  before  a  closing  parenthesis  or  the  end  of  the  sen¬ 
tence,  it  is  not  considered  a  comma  for  the  purposes  of  the  comma  constraint.  Another 
subtlety  is  that  the  comma  constraint  should  effectively  not  be  employed  when  pursu¬ 
ing  theories  of  an  NPB  subtree.  As  it  turns  out,  using  the  comma  constraint  also  affects 
accuracy,  as  shown  in  §7.1. 

The  final  form  of  pruning  employed  is  rather  subtle:  within  each  cell  of  the  chart 
that  contains  items  covering  some  span  of  the  sentence,  Collins'  parser  uses  buckets 
of  items  that  all  share  the  same  root  nonterminal  label  for  their  respective  derivations. 
Only  100  of  the  top-scoring  items  covering  the  same  span  with  the  same  nonterminal 

31  If  one  generates  commas  as  first-class  words,  as  we  have  done,  one  must  take  great  care  in  applying  this 
comma  constraint,  for  otherwise,  chart  items  that  represent  partially-completed  constituents  (i.e.,  constituents 
for  which  not  all  modifiers  have  been  generated)  may  be  incorrectly  rejected.  This  is  especially  important  for 
NPB  constituents. 
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Model 

Performance  on  Section  00 

LR 

LP 

CBs 

0  CBs 

<  2  CBs 

F 

Collins'  Model  2 

89.75 

90.19 

0.77 

69.10 

88.31 

89.97 

Baseline  (Model  2  emulation) 

89.89 

90.14 

0.78 

68.82 

89.21 

90.01 

Clean-room  Model  2 

88.85 

88.97 

0.92 

65.55 

87.06 

88.91 

Figure  15 

Overall  parsing  results  using  only  details  found  in  (Collins,  1997;  Collins,  1999).  The  first  two 
lines  show  the  results  of  Collins'  parser  and  those  of  our  parser  in  its  "complete"  emulation 
mode  (i.e.,  including  unpublished  details).  All  reported  scores  are  for  sentences  of  length  <  40 
words.  LR/LP  are  the  primary  scoring  metrics,  labeled  precision  and  labeled  recall,  respectively. 
CBs  is  the  number  of  crossing  brackets.  0  CBs  and  <  2  CBs  are  the  percentage  of  sentences  with  0 
and  <  2  crossing  brackets,  respectively.  F  (the  "F-measure")  is  the  evenly-weighted  harmonic 
mean  of  precision  and  recall,  or  iPLPpPPR)  ■ 


label  are  kept  in  a  bucket,  meaning  that  if  a  new  item  is  proposed  and  there  are  already 
100  items  covering  the  same  span  with  the  same  label  in  the  chart,  then  it  will  be  com¬ 
pared  to  the  lowest-scoring  item  in  the  bucket.  If  it  has  a  higher  score,  it  will  be  added  to 
the  bucket  and  the  lowest-scoring  item  will  be  removed;  otherwise,  it  will  not  be  added. 
Apparently,  this  type  of  pruning  has  little  effect,  and  so  we  have  not  duplicated  it  in  our 
engine. 

6.3  Unknown  words  and  parts  of  speech 

When  the  parser  encounters  an  unknown  word,  the  first-best  tag  delivered  by  Ratna- 
parkhi's  tagger  (Ratnaparkhi,  1996)  is  used.  As  it  happens,  the  tag  dictionary  built  up 
when  training  contains  entries  for  every  word  observed,  even  if  it  was  a  low-frequency 
word.  This  means  that  when  decoding,  the  output  of  the  tagger  is  used  only  for  those 
words  that  are  truly  unknown,  i.e.,  that  were  never  observed  in  training.  For  all  other 
words,  the  chart  is  seeded  with  a  separate  item  for  each  tag  observed  with  that  word  in 
training. 

7.  Evaluation 

7.1  Effects  of  Unpublished  Details 

In  this  section  we  present  the  results  of  effectively  doing  a  "clean-room"  implementation 
of  Collins'  parsing  model,  that  is,  using  only  information  available  in  (Collins,  1997; 
Collins,  1999),  as  shown  in  Figure  15. 

The  clean-room  model  has  an  10.6%  increase  in  F-measure  error  when  compared 
to  Collins'  parser,  and  an  11.0%  increase  in  F-measure  error  compared  to  our  engine 
in  its  complete  emulation  of  Collins'  Model  2.  This  is  comparable  to  the  increase  in 
error  seen  when  removing  such  published  features  as  the  verb  intervening  component 
of  the  distance  metric,  where  F-measure  error  increases  by  9.86%,  or  the  subcat  feature, 
the  removal  of  which  results  in  a  7.62%  increase  in  F-measure  error.33  Therefore,  while 
the  collection  of  unpublished  details  presented  in  §§3,  4,  5  &  6  is  disparate,  in  toto  they 

32  Although  we  have  implemented  a  version  of  this  type  of  pruning  that  limits  the  number  of  items  that 
can  be  collected  in  any  one  cell,  that  is,  the  maximum  number  of  items  that  cover  a  particular  span. 

33These  F-measures  and  the  differences  between  them  were  calculated  from  experiments  presented  in 
(Collins,  1999,  page  201),  which,  unlike  our  reported  numbers,  were  on  all  sentences,  not  just  those  of  length 
<  40  words.  As  Collins  notes,  removing  both  the  distance  metric  and  subcat  features  results  in  a  gigantic 
drop  in  performance,  since  without  both  of  these  features,  the  model  has  no  way  to  encode  the  fact  that  flatter 
structures  should  be  avoided  in  several  crucial  cases,  such  as  for  PPs,  which  tend  to  prefer  one  argument  to 
the  right  of  their  head  children. 
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Model  description 

LR 

Peri 

LP 

:orman 

CBs 

ce  on  Sec 
0  CBs 

tion  00 
<  2  CBs 

F 

Collins'  Model  2 

89.75 

90.19 

0.77 

69.10 

88.31 

89.97 

Baseline  (Model  2  emulation) 

89.89 

90.14 

0.78 

68.82 

89.21 

90.01 

Unknown  word  threshold  =  5  and  unknown 
words  handled  in  "obvious"  way  (see  §4.3) 

89.94 

90.22 

0.77 

68.99 

89.27 

90.08 

No  training  trees  skipped  (see  §4.2) 

89.85 

90.12 

0.78 

68.71 

89.16 

89.98 

Beam  width  =  10af 

89.90 

90.14 

0.78 

68.93 

89.16 

90.02 

Non-deficient  estimation  (see  §5.7.1) 

89.75 

90.00 

0.80 

68.82 

88.88 

89.87 

No  comma  constraint  (see  §6.2) 

89.52 

89.80 

0.84 

68.09 

88.20 

89.66 

No  universal  p(w  \  t, )  model} 

89.40 

89.17 

0.88 

66.14 

87.92 

89.28 

Clean-room  Model  2 

88.85 

88.97 

0.92 

65.55 

87.06 

88.91 

Figure  16 


Effects  of  independently  removing/ changing  individual  details  on  overall  parsing  performance. 
All  reported  scores  are  for  sentences  of  length  <  40  words.  fWith  beam  width  =  105,  processing 
time  was  3.36  times  longer  than  with  standard  beam  (104).  }No  count  sharing  was  performed  for 
Pro p„  (see  §5.9),  and  p(w  \  t.)  estimates  were  side-specific  (see  §5.8.1). 


are  every  bit  as  important  to  overall  parsing  performance  as  certain  of  the  published 
features. 

This  does  not  mean  that  the  all  details  are  equally  important.  Figure  16  shows  the 
effect  of  independently  removing  or  changing  certain  of  the  more  than  30  unpublished 
details  on  overall  parsing  performance.34  Often,  the  detrimental  effect  is  quite  insignif¬ 
icant,  even  in  the  performance-obsessed  world  of  statistical  parsing,  and  occasionally, 
the  effect  is  not  even  detrimental  at  all.  That  is  why  we  do  not  claim  importance  of  any 
single  unpublished  detail,  but  rather  that  of  their  totality,  given  that  several  of  the  un¬ 
published  details  are,  most  likely,  interacting.  However,  we  note  that  certain  individual 
details,  such  as  the  "universal  p(w  \  t)  model",  do  appear  to  have  a  much  more  marked 
effect  on  overall  parsing  accuracy  than  others. 

7.2  Bilexical  dependencies 

The  previous  section  accounts  for  the  noticeable  effects  of  all  the  unpublished  details 
of  Collins'  model.  But  what  of  the  details  that  were  published?  In  Chapter  8  of  his 
thesis,  Collins  gives  an  account  on  the  motivation  of  various  features  of  his  model,  in¬ 
cluding  the  distance  metric,  the  model's  use  of  subcats  (and  their  interaction  with  the 
distance  metric)  and  structural  versus  semantic  preferences.  In  the  discussion  of  this 
last  issue,  Collins  points  to  the  fact  that  structural  preferences — which,  in  his  model,  are 
modeled  primarily  by  the  Pl  and  Pr  parameters — often  provide  the  right  information 
for  disambiguating  competing  analyses,  but  that  these  structural  preferences  may  be 
"overridden"  by  semantic  preferences.  Bilexical  statistics  (Eisner,  1996),  as  represented 
by  the  maximal  context  of  the  Pl„  and  Pru  parameters,  serve  as  a  proxy  for  such  se¬ 
mantic  preferences,  where  the  actual  modifier  word  (as  opposed  to,  say,  merely  its  part 
of  speech)  serves  to  indicate  the  particular  semantics  of  its  head.  Indeed,  such  bilexi¬ 
cal  statistics  were  widely  assumed  for  some  time  to  be  a  source  of  great  discriminative 
power  for  several  different  parsing  models,  including  that  of  Collins. 

However,  Gildea  (2001)  had  re-implemented  Collins'  Model  1  (essentially  Model 
2  but  without  subcats)  and  had  altered  the  Plic  and  Pu  c  parameters  so  that  they  no 

34 As  a  reviewer  pointed  out,  the  use  of  the  comma  constraint  is  a  "published"  detail.  However,  the 
specifics  of  how  certain  commas  do  not  apply  to  the  constraint  is  an  "unpublished  detail",  as  mentioned  in 
§6.2. 
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Back-off  level 

Number  of  accesses 

Percentage 

0 

3,257,309 

1.49 

1 

24,294,084 

11.0 

2 

191,527,387 

87.4 

Total 

219,078,780 

100.0 

Figure  17 

Number  of  times  our  parsing  engine  was  able  to  deliver  a  probability  for  the  various  levels  of 
back-off  of  the  mod-word  generation  model,  Pm„  ,  when  testing  on  Section  00,  having  trained  on 
Sections  02-21.  In  other  words,  this  table  reports  how  often  a  context  in  the  back-off  chain  of 
Pm„  that  was  needed  during  decoding  was  observed  in  training. 


longer  had  the  top  level  of  context  that  included  the  head  word  (he  removed  "Back¬ 
off  level  0",  as  depicted  in  Figure  11).  In  other  words,  Gildea  removed  all  bilexical 
statistics  from  the  overall  model.  Surprisingly,  this  resulted  in  only  a  0.45%  absolute 
reduction  in  F-measure  (3.3%  relative  increase  in  error).  Unfortunately,  this  result  was 
not  entirely  conclusive,  in  that  Gildea  was  only  able  to  do  a  partial  re-implementation 
of  Collins'  baseline  model,  the  performance  of  which  was  not  quite  as  good  as  that 
of  Collins'  parser.35  Training  on  Sections  02-21,  we  have  duplicated  Gildea's  bigram- 
removal  experiment,  except  that  our  chosen  test  set  is  Section  00  instead  of  Section  23 
and  our  chosen  model  is  the  more  widely-used  Model  2.  Using  the  mode  that  most 
closely  emulates  Collins'  Model  2,  with  bigrams,  our  engine  gets  a  recall  of  89.89%  and 
a  precision  of  90.14%  on  sentences  of  length  <  40  words  (see  Figure  19,  Model  Mtw,tw)- 
Without  bigrams,  performance  drops  only  to  R89.49%,  P89.95% — an  exceedingly  small 
drop  in  performance  (see  Figure  19,  Model  M.tW}t)-  In  an  additional  experiment,  we 
examined  the  number  of  times  that  the  parser  was  able  to  deliver  a  requested  probability 
for  the  modifier-word  generation  model  using  the  increasingly  less-specific  contexts  of 
the  three  back-off  levels,  while  decoding  Section  00.  The  results  are  in  Figure  17.  Back¬ 
off  level  0  indicates  the  use  of  the  full  history  context,  which  contains  the  head  child's 
head  word.  Note  that  probabilities  making  use  of  this  full  context,  that  is  making  use 
of  bilexical  dependencies,  are  used  only  1.49%  of  the  time.  Combined  with  the  results 
from  the  previous  experiment,  this  suggests  rather  convincingly  that  such  statistics  are 
far  less  significant  than  once  thought  to  the  overall  discriminative  power  of  Collins' 
models,  confirming  Gildea's  result  for  Model  2. 36 

7.3  Choice  of  heads 

If  not  bilexical  statistics,  then  surely,  one  might  think,  head-choice  is  critical  to  the  per¬ 
formance  of  a  head-driven,  lexicalized  statistical  parsing  model.  Partly  to  this  end,  in 
(Chiang  and  Bikel,  2002),  we  explored  methods  for  recovering  latent  information  in  tree- 
banks.  The  second  half  of  that  paper  focused  on  a  use  of  the  Inside-Outside  algorithm 
to  re-estimate  the  parameters  of  a  model  defined  over  an  augmented  tree  space,  where 
the  observed  data  were  considered  to  be  the  gold-standard  labeled  bracketings  found 
in  the  Treebank,  and  the  hidden  data  were  considered  to  be  the  head  lexicalizations, 
one  of  the  most  notable  tree  augmentations  performed  by  modern  statistical  parsers. 
These  EM  experiments  were  motivated  by  the  desire  to  overcome  the  limitations  im¬ 
posed  by  the  heuristics  that  have  been  heretofore  used  to  perform  head  lexicalization  in 

35  The  re-implementation  was  necessarily  only  partial,  as  Gildea  did  not  have  access  to  all  the  unpub¬ 
lished  details  of  Collins'  models  that  are  presented  in  this  paper. 

36  On  a  separate  note,  it  may  come  as  a  surprise  that  the  decoder  needs  to  access  more  than  219  million 
probabilities  during  the  course  of  parsing  the  1917  sentences  of  Section  00.  Among  other  things,  this  certainly 
points  to  the  utility  of  caching  probabilities  (the  219  million  are  tokens,  not  types). 
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Model 

LR 

LP 

CBs 

0  CBs 

<  2  CBs 

F 

Collins'  Model  2 

89.75 

90.19 

0.77 

69.10 

88.31 

89.97 

Baseline  (Model  2  emulation) 

89.89 

90.14 

0.78 

68.82 

89.21 

90.01 

Simplified  head  rules 

88.55 

88.80 

0.86 

67.25 

87.42 

88.67 

Figure  18 

Results  on  Section  00  with  simplified  head  rules.  The  baseline  model  is  our  engine  in  its  closest 
possible  emulation  of  Collins'  Model  2. 


Parameter  class 

Pm 

Pm„ 

Score 

conditioning  on 

^9! 

Wh 

Wh 

LR 

LP 

CBs 

0  CBs 

<  2  CBs 

F 

A'i-tw  ,tw 

a/t  a  1  Mt,t 

Model  KA 

Mtw,z 

Mt,0 

M.0,0 

n 

/ 

MM 

/ 

89.89 

90.14 

0.78 

68.82 

89.21 

90.01 

/ 

89.49 

89.95 

0.80 

67.98 

88.82 

89.72 

88.20 

88.89 

0.91 

65.00 

87.13 

88.54 

/ 

89.24 

89.86 

0.81 

66.80 

88.76 

89.55 

mm 

88.01 

88.96 

0.91 

63.93 

86.91 

88.48 

87.01 

88.75 

0.96 

61.08 

86.00 

87.87 

Figure  19 

Parsing  performance  with  various  models  on  Section  00  of  the  Penn  Treebank.  Pm  is  the 
parameter  class  for  generating  partially  lexicalized  modifying  nonterminals  (a  nonterminal  label 
and  part  of  speech).  is  the  parameter  class  that  generates  the  head  word  of  a  modifying 
nonterminal.  Together,  Pm  and  Pm„  generate  a  fully-lexicalized  modifying  nonterminal.  The 
check  marks  indicate  the  inclusion  of  the  head  word  wn  and  its  part  of  speech  th  of  the 
lexicalized  head  nonterminal  H(th,  Wh)  in  the  conditioning  contexts  of  Pm  and  Pm„  • 


treebanks.  In  particular,  it  appeared  that  the  head  rules  used  in  Collins'  parser  had  been 
tweaked  specifically  for  the  English  Penn  Treebank.  The  promise  of  using  EM  would 
mean  that  very  little  effort  would  need  be  spent  on  developing  head  rules,  since  EM 
could  take  an  initial  model  that  used  simple  heuristics  and  optimize  it  appropriately 
to  maximize  the  likelihood  of  the  unlexicalized  (observed)  training  trees.  To  test  this, 
we  performed  experiments  with  an  initial  model  trained  using  an  extremely  simpli¬ 
fied  head-rule  set,  where  all  rules  were  of  the  form  "if  the  parent  is  X,  then  choose  the 
left/ rightmost  child".  A  surprising  side  result  was  that,  even  with  this  simplified  set  of 
head  rules,  overall  parsing  performance  still  remained  quite  high.  Using  our  simplified 
head-rule  set  for  English,  our  engine  in  its  "Model  2-emulation  mode"  achieved  a  re¬ 
call  of  88.55%  and  a  precision  of  88.80%,  for  sentences  of  length  <40  words  in  Section 
00  (see  Figure  18).  So,  contrary  to  our  expectations,  the  lack  of  careful  head  choice  is 
not  crippling  in  allowing  the  parser  to  disambiguate  competing  theories,  and  a  further 
indication  that  semantic  preferences,  as  represented  by  conditioning  on  a  head  word, 
rarely  override  structural  ones. 

7.4  Lexical  dependencies  matter 

Given  that  bilexical  dependencies  are  almost  never  used  and  barely  affect  overall  pars¬ 
ing  performance,  and  given  that  the  choice  of  head  is  not  terribly  critical  either,  one 
might  wonder  what  power,  if  any,  head  lexicalization  is  providing.  The  answer  is  that, 
even  when  one  removes  bilexical  dependencies  from  the  model,  there  are  still  plenty 
of  lexico-structural  dependencies,  i.e.,  structures  being  generated  conditioning  on  head 
words,  and  head  words  being  generated  conditioning  on  structures. 

To  test  the  effect  of  such  lexico-structural  dependencies  in  our  lexicalized  PCFG- 
style  formalism,  we  experimented  with  the  removal  of  the  head  tag  t/ ,  and/or  the  head 
word  Wh  from  the  conditioning  contexts  of  the  Pmw  and  Pm  parameters.  The  results 
are  listed  in  Figure  19.  Model  Mtw,tw  shows  our  baseline,  and  Model  M 0,0  shows  the 
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effect  of  removing  all  dependence  on  the  head  word  and  its  part  of  speech,  with  the 
other  models  illustrating  varying  degrees  of  removing  elements  from  the  two  parame¬ 
ter  classes'  conditioning  contexts.  Notably,  including/ removing  the  head  word  Wh  from 
the  Pm  contexts  appears  to  have  a  significant  effect  on  overall  performance,  as  shown 
by  moving  from  Model  Mtw,t  to  Model  .M/j  and  from  Model  Mtw,0  to  Model 
This  reinforces  the  notion  that  particular  head  words  have  structural  preferences,  so  that 
making  the  Pm  parameters  dependent  on  head  words  would  capture  such  preferences. 
As  for  effects  involving  dependence  on  the  head  tag  th,  observe  that  when  moving  from 
Model  Mtw,t  to  Model  Mtw,0,  there  is  a  small  drop  in  both  recall  and  precision,  whereas 
when  making  an  analogous  move  from  Model  Mt,t  to  Model  Mt,0,  there  is  a  drop  in 
recall,  but  a  slight  gain  in  precision  (the  two  moves  are  analogous,  in  that  in  both  cases, 
th  is  dropped  from  the  context  of  Pm„  )•  It  is  not  evident  why  these  two  moves  do  not 
produce  similar  performance  losses,  but  in  both  cases,  the  performance  drops  are  small 
relative  to  those  observed  when  eliminating  Wh  from  the  conditioning  contexts,  indi¬ 
cating  that  head  words  matter  far  more  than  parts  of  speech  for  determining  structural 
preferences,  as  one  would  expect. 

8.  Conclusion 

We  have  documented  what  we  believe  is  the  complete  set  of  heretofore  unpublished 
details  Collins  used  in  his  parser,  such  that,  along  with  Collins'  thesis  (Collins,  1999), 
this  paper  contains  all  information  necessary  to  duplicate  Collins'  benchmark  results. 
Indeed,  these  as-yet-unpublished  details  account  for  an  11%  relative  increase  in  er¬ 
ror  from  an  implementation  including  all  details  to  a  clean-room  implementation  of 
Collins'  model.  We  have  also  shown  a  cleaner  and  equally-well-performing  method 
for  the  handling  of  punctuation  and  conjunction,  and  we  have  revealed  certain  other 
probabilistic  oddities  about  Collins'  parser.  We  have  analyzed  not  only  the  effect  of  the 
unpublished  details,  but  also  re-analyzed  the  effect  of  certain  well-known  details,  re¬ 
vealing  that  bilexical  dependencies  are  barely  used  by  the  model  and  that  head  choice 
is  not  nearly  as  important  to  overall  parsing  performance  as  once  thought.  Finally,  we 
have  performed  experiments  that  show  that  the  true  discriminative  power  of  lexical- 
ization  appears  to  lie  in  the  fact  that  unlexicalized  syntactic  structures  are  generated 
conditioning  on  the  head  word  and  head  tag.  These  results  on  the  lack  of  reliance  on 
bilexical  statistics  suggest  that  generative  models  still  have  room  for  improvement  by 
employing  bilexical-class  statistics,  that  is,  dependencies  among  head-modifier  word 
classes,  where  such  classes  may  be  defined  by,  say,  WordNet  synsets.  Such  dependen¬ 
cies  might  finally  be  able  to  capture  the  semantic  preferences  that  were  thought  to  be 
captured  by  standard  bilexical  statistics,  as  well  as  alleviate  the  sparse  data  problems 
associated  with  standard  bilexical  statistics.  This  is  the  subject  of  our  current  research. 

Appendix:  Complete  List  of  Parameter  Classes 

This  section  contains  tables  for  all  parameter  classes  in  Collins'  Model  3,  with  appro¬ 
priate  modifications  and  additions  from  the  tables  presented  in  Collins'  thesis.  The 
notation  is  that  used  throughout  this  paper.  In  particular,  for  notational  brevity  we 
use  M(w,t)i  to  refer  to  the  three  items  Mu  and  wm{  that  constitute  some  fully- 
lexicalized  modifying  nonterminal,  and  similarly  M  ( t)  *  to  refer  to  the  two  items  M,  and 
tMi  that  constitute  some  partially-lexicalized  modifying  nonterminal.  The  (unlexical¬ 
ized)  nonterminal-mapping  functions  a  and  7  are  defined  at  the  beginning  of  §5.  As  a 
shorthand,  7 (M(f)j)  =  7 (Mi),  tMi . 

The  head-generation  parameter  class,  Ph,  gap-generation  parameter  class.  Pa  and 
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subcat-generation  parameter  classes,  PsubcatL  and  PsubcatR  ,  have  back-off  structures  as 
follows. 


Back-off  level 

Ph(H  |...) 

Pg(G  |...) 

^subcatL  (  SUbCQitL  |  .  .  .) 

P subcat r  (  Subctttji  |  •  •  •) 

0 

7 (P),  wh,  th 

a(7(P)),  a(7(-P)),  wh,  th 

1 

7 CP),  th 

a(l (P)),  th 

2 

7  (P) 

« (7 (P)),  « (7 (H)) 

The  two  parameter  classes  for  generating  modifying  nonterminals  that  are  not  dom¬ 
inated  by  a  base  NP,  Pm  and  Pm„  ,  have  the  following  back-off  structures.  Recall  that 
back-off  level  3  of  the  Pm„  parameters  includes  words  that  are  the  heads  of  the  observed 
roots  of  sentences  (that  is,  the  head  word  of  the  entire  sentence). 


Back-off  level 

Pm  ( M(t)i ,  coord,  punc  | . . .) 

0 

a(P),  7 (H),  wh,  th ,  Aside,  subcat side,  side 

1 

a(P),  7 (H),  th,  Aside ,  subcatside,  side 

2 

a(P),  7 (H),  Aside ,  subcatside,  side 

Back-off  level 

Pm„  (WMi  |  •  •  •) 

0 

7 (M(t)i),  coord,  punc,  a(P),  7 (H),  wh ,  th ,  A side,  subcatside,  side 

1 

7 (M(t)i),  coord,  punc,  a(P ),  7 (H),  th,  A side,  subcatside,  side 

2 

tMi 

The  two  parameter  classes  for  generating  modifying  nonterminals  that  are  children 
of  base  NPs  (NPB  nodes).  Pm, hpb  and  Pm„  .npb,  have  the  following  back-off  structures. 
Back-off  level  3  of  the  Pm„  ,npb  parameters  includes  words  that  are  the  heads  of  the  ob¬ 
served  roots  of  sentences  (that  is,  the  head  word  of  the  entire  sentence).  Also,  note  that 
there  is  no  coord  flag,  as  coordinating  conjunctions  are  generated  like  regular  modify¬ 
ing  nonterminals  when  they  are  dominated  by  NPB.  Finally,  we  define  Mo  =  H,  that  is, 
the  head  nonterminal  label  of  the  base  NP  that  was  generated  using  a  Ph  parameter. 


Back-off  level 

Pm, npb  (M(t)i,  punc| . . .) 

Pm„, npb  (wm;  1  •  •  •) 

0 

P,  M(w,t)i_  1,  side 

Mi,  tMi,  punc,  P,  M(w,t)i-i,  side 

1 

P,  M(t)i- 1,  side 

Mi,  tMi,  punc,  P,  M(t)i- 1,  side 

2 

P,  1,  side 

tMi 

The  two  parameter  classes  for  generating  punctuation  and  coordinating  conjunc¬ 
tions,  Ppunc  and  Pcoordr  have  the  following  back-off  structures  (Collins,  p.c.),  where 

•  type  is  a  flag  that  obtains  the  value  p  in  the  history  contexts  of  Ppunc  parameters 
and  c  in  the  history  contexts  of  PCOOrd  parameters, 

•  M  ( w ,  t)i  is  the  modifying  preterminal  that  is  being  conjoined  to  the  head  child, 

•  tp/tc  is  the  particular  preterminal  (part  of  speech  tag)  that  is  conjoining  the 
modifier  to  the  head  child  (such  as  "CC"  or  "  : ")  and 

•  wp/wc  is  the  particular  word  that  is  conjoining  the  modifier  to  the  head  child 
(such  as  "and"  or  "  : "). 
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Back-off  level 

P zoord^c |  •  •  •) 

Ppunc(tp |  •  •  •) 

P zoordw  (  Wc  |  •  •  •) 

Ppuncw  (  Wp  |  •  •  •) 

0 

wh,  th,  P,  H,  type 

ttyPe ,  Wh,th,  P,  H,  M(w,t)i,  type 

1 

th,  P,  H,  M(t)i ,  type 

ttype,  th,  p,  H,  M(t)i ,  type 

2 

type 

ttype 

The  parameter  classes  for  generating  fully-lexicalized  root  nonterminals  given  the 
hidden  root  +TOP+,  Ptop  and  Ptop„>  have  the  following  back-off  structures  (identical 
to  the  table  in  Figure  14). 


Back-off  level 

PTopNT(H(t)  |  ...) 

PtoPv,  {w  |  •  •  •) 

0 

+  TOP  + 

t,  H,  +T0P+ 

1 

n/a 

t 

The  parameter  classes  for  generating  prior  probabilities  on  lexicalized  nonterminals 
M (w,  t ),  PPrior„  and  PPriorNT>  have  the  following  back-off  structures,  where  prior  is  a 
dummy  variable  to  indicate  that  Ppriorw  is  not  smoothed  (although  the  Ppr  ;orv  parame¬ 
ters  still  have  an  associated  smoothing  weight;  see  §5.7.2,  footnote). 


Back-off  level 

Ppriorw  ^|  •  •  •) 

PpriorNT  (  -^1  •  •  *) 

0 

prior 

w,  t 

1 

prior 

t 
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